Title |
Rarity of Words in a Language and in a Corpus |
Authors |
Hlaváčová Jaroslava (Institute of the Czech National Corpus, Faculty of Arts, nám. J. Palacha 2, Prague, Czech republic, jaroslava.hlavacova@ff.cuni.cz) |
Keywords |
Frequency, Language Corpus, Rarity, Reduced Frequency |
Session |
Session WP7 - Corpus Projects |
Full Paper |
295.ps, 295.pdf |
Abstract |
A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language.
This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared. |