Title |
Integrating Seed Names and ngrams for a Named Entity List and Classifier |
Authors |
Buchholz Sabine (ILK / Computational Linguistics Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands , email:fS.Buchholz@kub.nl, http://ilk.kub.nl) van den Bosch Antal (ILK / Computational Linguistics Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands , email:vdnBoschg@kub.nl, http://ilk.kub.nl) |
Keywords |
|
Session |
Session WO14 - Named Entity Recognition |
Full Paper |
141.ps, 141.pdf |
Abstract |
We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a memory-based machine learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61 % and a recall of 56 %. On free text, named-entity token labeling accuracy is 71 %. |