Title |
The Hungarian National Corpus |
Authors |
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences) |
Session |
WP1: Corpora & Corpus Tools |
Abstract |
The paper reports on the development of the Hungarian National Corpus, which was completed at the end of 2001 after four years' effort. The HNC is designed to be a balanced reference corpus of current written Hungarian consisting of 150 million words. The paper first discusses basic design issues concerning the composition of the corpus. The HNC adopts a fairly pragmatic approach, focusing on five major text types. The second half of the paper contains details of the annotation and tagging system used. |
Keywords |
Corpus annotation, Tagging, Disambiguation, Representative, Tiered tagging |
Full Paper |