LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
Authors Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be)
Daelemans Walter (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, daelem@uia.ua.ac.be)
Keywords Combining Systems, Machine Learning, Reuse of Resources, Tagging
Session Session WO1 - Corpus Tagging
Full Paper 155.ps, 155.pdf
Abstract This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.