LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Principled Hidden Tagset Design for Tiered Tagging of Hungarian |
Authors | Tufiş Dan (RACAI-Romanian Academy 13, “13 Septembrie”, Ro-74311, Bucharest 5, Romania, email:tufis@valhalla.racai.ro) Dienes Péter (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, dienes@nytud.hu) Oravecz Csaba (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, oravecz@nytud.hu) Váradi Tamás (Linguistics Institute, Hungarian Academy of Sciences, H-1014 Budapest Színház u 5-9, varadi@nytud.hu) |
Keywords | Corpus Annotation, Tagset Design, Tagset Reduction, Tiered Tagging |
Session | Session WO18 - Morphology in Lexical and Textual Resources |
Full Paper | 249.ps, 249.pdf |
Abstract | For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features. |