Title | Morphology Based Automatic Acquisition of Large-coverage Lexica |
Author(s) |
Lionel Clément, Benoît Sagot, Bernard Lang
INRIA - Institut National de Recherche en Informatique et en Automatique - Domaine de Voluceau, Rocquencourt, B.P. 105, 78153 Le Chesnay, France |
Session | P20-W |
Abstract | In this article, we introduce a new technique for constructing wide-coverage morphological lexica from large corpora and morphological knowledge, with an application to French. Basically, it relies on the idea that the existence of a hypothetical lemma can be guessed if several different words found in the corpus are best interpreted as morphological variants of this lemma. We first validated our technique by extracting verbs and adjectives on a general French corpus of 25 million words. Compared with other lexical resources available for French, our results are very satisfying, since we cover many words, often derived words, that are not always present in other lexica. Application of our algorithm to the acquisition of domain-specific adjectives on a botanic corpus gave also very good results, thus demonstrating its usability to extract domain-specific lexica. Moreover, it is generalizable to any language with a substantial morphology. Part of the resulting lexicon (currently verbal forms) is already freely available on http://www.lefff.net/. |
Keyword(s) | Morphological lexica, Automatic aquisition, Corpus statistics |
Language(s) | French |
Full Paper | 711.pdf |