LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Developing Guidelines and Ensuring Consistency for Chinese Text Annotation |
Authors | Xia Fei (Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, fxia@linc.cis.upenn.edu) Palmer Martha (Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, mpalmer@linc.cis.upenn.edu) Xue Nianwen (Linguistics Department, University of Delaware, Newark, DE 19716, USA, xueniwen@UDel.Edu) Okurowski Mary Ellen (US Department of Defense, Ft. Meade, MD 20755, USA, meokuro@super.org) Kovarik John (US Department of Defense, Ft. Meade, MD 20755, USA, kovariks@worldnet.att.net) Chiou Fu-Dong (Linguistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA, chioufd@linc.cis.upenn.edu) Huang Shizhe (East Asian Studies Program, Haverford College, Haverford, PA 19041, USA, shuang@haverford.edu) Kroch Tony (Linguistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA, kroch@linc.cis.upenn.edu) Marcus Mitch (Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, mitch@linc.cis.upenn.edu) |
Keywords | Annotation Guidelines, Bracketed Corpus (Treebank), Chinese Language Processing, Quality Control |
Session | Session WO1 - Corpus Tagging |
Full Paper | 287.ps, 287.pdf |
Abstract | With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public summer 2000. In this paper, we will address several challenges in building the corpus, namely, creating annotation guidelines, ensuring annotation accuracy and maintaining a high level of community involvement. |