LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers and Abstracts
Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies | This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully. | |
MDWOZ: A Wizard of Oz Environment for Dialog Systems Development | This paper describes MDWOZ, a development environment for spoken dialog systems based on the Wizard of Oz technique, whose main goal is to facilitate data collection (speech signal and dialog related information) and interaction model building. Both these tasks can be quite difficult, and such an environment can facilitate them very much. Due to the modular way in which MDWOZ was implemented, it is possible to reuse parts of it in the final dialog system. The environment provides language-transparent facilities and accessible methods such that even non-computing specialists can participate in spoken dialog systems development. The main features of the environment are presented, together with some test experiments. | |
Methods and Metrics for the Evaluation of Dictation Systems: a Case Study | This paper describes the practical evaluation of two commercial dictation systems in order to assess the potential usefulness of such technology in the specific context of a translation service translating legal text into Italian. The service suffers at times from heavy workload, lengthy documents and short deadlines. Use of dictation systems accepting continuous speech might improve productivity at these times. Design and execution of the evaluation followed the methodology worked out by the EAGLES Evaluation Working Group. The evaluation therefore also constitutes a test bed application of this methodology. | |
MHATLex: Lexical Resources for Modelling the French Pronunciation | The aim of this paper is to introduce the lexical resources and environment, called MHATLex, and intended for speech and text processing. A particular attention is paid to a pronunciation modelling which can be used in automatic speech processing as well as in phonological/phonetic description of languages. In our paper we will introduce a pronunciation model, the MHAT model (Markovian Harmonic Adaptation and Transduction), which copes with free and context-dependent variants. At the same time, we will present the MHATLex resources. They include 500,000 inflected forms and tools allowing the generation of various lexicons through phonological tables. Finally, some illustrations of the use of MHATLex in ASR will be shown. | |
Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation | Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classification techniques in Japanese named entity recognition in the context of minimally supervised learning. This experimental evaluation demonstrates that the minimally supervised learning method proposed here improved the performance of the seed knowledge on named entity chunking and classification. We also investigated the correlation between performance of the minimally supervised learning and the sizes of the training resources such as the seed set as well as the unlabeled training data. | |
Models of Russian Text/Speech Interactive Databases for Supporting of Scientific, Practical and Cultural Researches | The paper briefly describes the following databases: ”Online Sound Archives from St. Petersburg Collections”, ”Regional Variants of the Russian Speech”, and ”Multimedia Dictionaries of the minor Languages of Russia”, the principle feature of which is the built-in support for scientific, practical and cultural researches. Though these databases are addressed to researchers engaged mainly in Spoken Language Processing and because of that their main object is Sound, proposed database ideology and general approach to text/speech data representation and access may be further used for elaboration of various language resources containing text, audio and video data. Such approach requests for special representation of the database material. Thus, all text and sound files should be accompanied by information on their multi-level segmentation, which should allow the user to extract and analyze any segment of text or speech. Each significant segment of the database should be perceived as a potential object of investigation and should be supplied by tables of descriptive parameters, mirroring its various characteristics. The list of these parameters for all potential objects is open for further possible extension. | |
Modern Greek Corpus Taxonomy | The aim of this paper is to explore the way in which different kind of linguistic variables can be used in order to discriminate text type in 240 preclassified press texts. Modern Greek (MG) language due to its past diglossic status exhibits extended variation in written texts across all linguistic levels and can be exploited in text categorization tasks. The research presented used Discriminant Function Analysis (DFA) as a text categorization method and explores the way different variable groups contribute to the text type discrimination. | |
Morphemic Analysis and Morphological Tagging of Latvian Corpus | There are approximately 8 million running words in Latvian Corpus and it is initial size for investigations using national corpus. The corpus contains different texts: modern written Latvian, different newspapers, Latvian classical literature, Bible, Latvian Folk Believes, Latvian Folk Songs, Latvian Fairy-tales and other. Methodology and the software for SGML tagging are developed by Artificial Intelligence Laboratory; approximately 3 million running words is marked up by SGML language. The first step was to develop morphemic analysis in co-operation with Dr. B. Kangere from Stockholm University. The first morphological analyzer was developed in 1994 at Artificial Intelligence Laboratory. The analyzer has its own tag system. Later the tags for the morphological analyzer were elaborated according to MULTEXT-EAST recommendations. Latvian morphological system is rather complicate and there are many difficulties with the recognition of words, word forms as far as Latvian has many homonymous forms. The first corpus of texts of morphological analysis is marked up manually. Totally it covers approximately 10 000 words of modern written Latvian. The results of this work will be used in the further investigations. | |
Morphological Tagging to Resolve Morphological Ambiguities | The issue of this paper is to present the advantages of a morphological tagging of English in order to resolve morphological ambiguities. Such a way of tagging seems to be more efficient because it allows an intention description of morphological forms compared with the extensive collection of usual dictionaries. This method has already been experimented on French and has given promising results. It is very relevant since it allows both to bring hidden morphological rules to light which are very useful especially for foreign learners and take lexical creativity into account. Moreover, this morphological tagging was conceived in relation to the subsequent disambiguation which is mainly based on local grammars. The purpose is to create a morphological analyser being easily adaptable and modifiable and avoiding the usual errors of the ordinary morphological taggers linked to dictionaries. | |
Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets | The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with a large number of possible word-class tags and only a small (hand-tagged) dataset. We report on training and testing of four different taggers on the Slovene MULTEXT-East corpus containing about 100.000 words and 1000 different morphosyntactic tags. Results show, first of all, that training times of the Maximum Entropy Tagger and the Rule Based Tagger are unacceptably long, while they are negligible for the Memory Based Taggers and the TnT tri-gram tagger. Results on a random split show that tagging accuracy varies between 86% and 89% overall, between 92% and 95% on known words and between 54% and 55% on unknown words. Best results are obtained by TnT. The paper also investigates performance in relation to our EAGLES-based morphosyntactic tagset. Here we compare the per-feature accuracy on the full tagset, and accuracies on these features when training on a reduced tagset. Results show that PoS accuracy is quite high, while accuracy on Case is lowest. Tagset reduction helps improve accuracy, but less than might be expected. | |
Multilingual Linguistic Resources: From Monolingual Lexicons to Bilingual Interrelated Lexicons | This paper describes a procedure to convert the PAROLE-SIMPLE monolingual lexicons into bilingual interrelated lexicons where each word sense of a given language is linked to the pertinent sense of the right words in one or more target lexicons. Nowadays, SIMPLE lexicons are monolingual although the ultimate goal of these harmonised monolingual lexicons is to build multilingual lexical resources. For achieving this goal it is necessary to automatise the linking among the different senses of the different monolingual lexicons, as the production of such multilingual relations by hand will be, as all tasks related with the development of linguistic resources, unaffordable in terms of human resources and time spent. The system we describe in this paper takes advantage of the SIMPLE model and the SIMPLE based lexicons so that, in the best case, it can find fully automatically the relevant sense-to-sense correspondences for determining the translational equivalence of two words in two different languages and, in the worst case, it will be able to narrow the set of admissible links between words and relevant senses. This paper also explores to what extent semantic encoding in already existing computational lexicons such as SIMPLE can help in overcoming the problems arisen when using monolingual meaning descriptions for bilingual links and aims to set the basis for defining a model for adding a bilingual layer to the SIMPLE model. This bilingual layer based on a bilingual relation model will be the basis indeed for defining the multilingual language resource we want PAROLE-SIMPLE lexicons to become. | |
Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation | Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success. |