LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Morphemic Analysis and Morphological Tagging of Latvian Corpus
Authors Levāne Kristīne (Institute of Mathematics and Computer Science of the University of LatviaRaina bulvaris 29, LV - 1459, Riga, Latvia, email:kristine@ailab.miii.lu.lv)
Spektors Andrejs (Institute of Mathematics and Computer Science of the University of Latvia Raina bulvaris 29, LV - 1459, Riga, Latvia, email: aspekt@ailab.mii.lu.lv)
Keywords  
Session Session WP5 - Corpus Tagging
Full Paper 107.ps, 107.pdf
Abstract There are approximately 8 million running words in Latvian Corpus and it is initial size for investigations using national corpus. The corpus contains different texts: modern written Latvian, different newspapers, Latvian classical literature, Bible, Latvian Folk Believes, Latvian Folk Songs, Latvian Fairy-tales and other. Methodology and the software for SGML tagging are developed by Artificial Intelligence Laboratory; approximately 3 million running words is marked up by SGML language. The first step was to develop morphemic analysis in co-operation with Dr. B. Kangere from Stockholm University. The first morphological analyzer was developed in 1994 at Artificial Intelligence Laboratory. The analyzer has its own tag system. Later the tags for the morphological analyzer were elaborated according to MULTEXT-EAST recommendations. Latvian morphological system is rather complicate and there are many difficulties with the recognition of words, word forms as far as Latvian has many homonymous forms. The first corpus of texts of morphological analysis is marked up manually. Totally it covers approximately 10 000 words of modern written Latvian. The results of this work will be used in the further investigations.