LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers and Abstracts
Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Use of Greek and Latin Forms for Term Detection | It is well known that many languages make use of neo-classical compounds, and that some domains with a very long tradition like medicine made an intense use of such morphemes. This phenomenon has been largely studied for different languages with the common result that a relatively short number of morphemes allows the detection of a high number of specialised terms to be produced. We believe that the use of such morphological knowledge may help a term detector in discovering very specialised terms. In this paper we propose a module to be included in a term extractor devoted specifically to detect terms that include neo-classical compounds. We describe such module as well the results obtained from it. | |
Using a Formal Approach to Evaluate Grammars | In this paper, we present a methodological formal approach to evaluate grammars based on a unified representation. This approach uses two kinds of criteria. The first one considers a grammar as a resource enabling the representation of particular aspects of a given language. The second is interested in using grammars in the development of lingware. The evaluation criteria are defined in a formal way. In addition, we indicate for every criterion how it would be applied. | |
Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging | The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results. | |
Using Few Clues Can Compensate the Small Amount of Resources Available for Word Sense Disambiguation | Word Sense Disambiguation (WSD) is considered as one of the most difficult tasks in Natural Language Processing. Probabilistic methods have shown their efficiency in many NLP tasks, but they imply a training phase and very few resources are available for WSD. This paper aims at showing how to make the most of size-limited resources in order to partially overcome the knowledge acquisition bottleneck. Experiments are performed within the SENSEVAL test framework in order to evaluate the advantage of a lemmatized or stemmed context over an original context (inflected forms as they are observed in the rough text). Then, we measure the precision improvement (about 6 %) when looking at the inflected form of the word to be disambiguated. Lastly, we show that it is possible to reduce the ambiguity if the word to be disambiguated has a particular inflected form or occurs as part of a compound. | |
Using Lexical Semantic Knowledge from Machine Readable Dictionaries for Domain Independent Language Modelling | Machine Readable Dictionaries (MRDs) have been used in a variety of language processing tasks including word sense disambiguation, text segmentation, information retrieval and information extraction. In this paper we describe the utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications. A semantic model of language has been derived using the dictionary definitions in order to compute the semantic association between the words. The model is capable of capturing phenomena of latent semantic dependencies between the words in texts and reducing the language ambiguity by a considerable factor. The results of experiments suggest that the semantic model can improve the word recognition rates in “noisy-channel” applications. This research provides evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling. | |
Using Machine Learning Methods to Improve Quality of Tagged Corpora and Learning Models | Corpus-based learning methods for natural language processing now provide a consistent way to achieve systems with good performance. A number of statistical learning models have been proposed and are used in most of the tasks which used to be handled by rule-based systems. When the learning systems come to such a level as competitive as manually constructed systems, both large scale training corpora and good learning models are of great importance. In this paper, we first discuss that the main hindrances to the improvement of corpus-based learning systems are the inconsistencies or the errors existing in the training corpus and the defectiveness in the learning model. We then show that some machine learning methods are useful for effective identification of the erroneous source in the training corpus. Finally, we discuss how the various types of errors should be coped with so as to improve the learning environments. |