Title

Unexpected Productions May Well be Errors

Author(s)

Tylman Ule(1), Kiril Simov (2)

(1) Seminar für Sprachwissenschaft, Universität Tübingen; (2) Linguistic Modelling Laboratory, Bulgarian Academy of Sciences

Session

P19-SW

Abstract

We present a method for detecting annotation errors in treebanks. It assumes that errors are unexpected small tree fragments. We generate statistics over configurations of these fragments using a standard statistical test. We use the test result and the characteristics of their distributions as features to classify unseen configurations as likely errors via machine learning. Evaluation shows that the resulting list of error candidates is reliable, independent of corpus size, annotation quality, and target language.

Keyword(s)

error detection, treebanks, manual annotation, language independent, machine learning

Language(s) Bulgarian, German
Full Paper

483.pdf