LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers
Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Previous Paper Next Paper
Title | Rarity of Words in a Language and in a Corpus |
Authors |
Hlavacova Jaroslava (Institute of the Czech National Corpus, Faculty of Arts, nam. J. Palacha 2, Prague, Czech republic, jaroslava.hlavacova@ff.cuni.cz) |
Keywords | Frequency, Language Corpus, Rarity, Reduced Frequency |
Session | Session WP7 - Corpus Projects |
Abstract | A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared. |