LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Rarity of Words in a Language and in a Corpus
Authors Hlaváčová Jaroslava (Institute of the Czech National Corpus, Faculty of Arts, nám. J. Palacha 2, Prague, Czech republic, jaroslava.hlavacova@ff.cuni.cz)
Keywords Frequency, Language Corpus, Rarity, Reduced Frequency
Session Session WP7 - Corpus Projects
Full Paper 295.ps, 295.pdf
Abstract A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared.