Title

The Hungarian National Corpus

Authors

Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)

Session

WP1: Corpora & Corpus Tools

Abstract

The paper reports on the development of the Hungarian National Corpus, which was completed at the end of 2001 after four years' effort. The HNC is designed to be a balanced reference corpus of current written Hungarian consisting of 150 million words. The paper first discusses basic design issues concerning the composition of the corpus. The HNC adopts a fairly pragmatic approach, focusing on five major text types. The second half of the paper contains details of the annotation and tagging system used. 

Keywords

Corpus annotation, Tagging, Disambiguation, Representative, Tiered tagging

Full Paper

217.pdf