LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Integrating Seed Names and ngrams for a Named Entity List and Classifier
Authors Buchholz Sabine (ILK / Computational Linguistics Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands , email:fS.Buchholz@kub.nl, http://ilk.kub.nl)
van den Bosch Antal (ILK / Computational Linguistics Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands , email:vdnBoschg@kub.nl, http://ilk.kub.nl)
Keywords  
Session Session WO14 - Named Entity Recognition
Full Paper 141.ps, 141.pdf
Abstract We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a memory-based machine learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61 % and a recall of 56 %. On free text, named-entity token labeling accuracy is 71 %.