LREC 2000 - Papers

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers

Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Previous Paper Next Paper

Title Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus

Authors Van Eynde Frank (Center for Computational Linguistics, Maria-Theresiastraat 21, 3000 Leuven, Belgium, frank.vaneynde@ccl.kuleuven.ac.be)
Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be)
Daelemans Walter (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, daelem@uia.ua.ac.be)

Keywords Dutch, POS Tagging, Tagger Evaluation, Tagset Design

Session Session WO18 - Morphology in Lexical and Textual Resources

Abstract This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.

rdana">

Title	Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus
Authors	Van Eynde Frank (Center for Computational Linguistics, Maria-Theresiastraat 21, 3000 Leuven, Belgium, frank.vaneynde@ccl.kuleuven.ac.be) Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be) Daelemans Walter (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, daelem@uia.ua.ac.be)
Keywords	Dutch, POS Tagging, Tagger Evaluation, Tagset Design
Session	Session WO18 - Morphology in Lexical and Textual Resources
Abstract	This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.