LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus
Authors Van Eynde Frank (Center for Computational Linguistics, Maria-Theresiastraat 21, 3000 Leuven, Belgium, frank.vaneynde@ccl.kuleuven.ac.be)
Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be)
Daelemans Walter (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, daelem@uia.ua.ac.be)
Keywords Dutch, POS Tagging, Tagger Evaluation, Tagset Design
Session Session WO18 - Morphology in Lexical and Textual Resources
Full Paper 216.ps, 216.pdf
Abstract This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.