LREC 2014 Proceedings

Summary of the paper

Title	Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Authors	Veronika Vincze, Viktor Varga, Katalin Ilona Simkó, János Zsibrita, Ágoston Nagy, Richárd Farkas and János Csirik
Abstract	The Szeged Corpus is the largest manually annotated database containing the possible morphological analyses and lemmas for each word form. In this work, we present its latest version, Szeged Corpus 2.5, in which the new harmonized morphological coding system of Hungarian has been employed and, on the other hand, the majority of misspelled words have been corrected and tagged with the proper morphological code. New morphological codes are introduced for participles, causative / modal / frequentative verbs, adverbial pronouns and punctuation marks, moreover, the distinction between common and proper nouns is eliminated. We also report some statistical data on the frequency of the new morphological codes. The new version of the corpus made it possible to train magyarlanc, a data-driven POS-tagger of Hungarian on a dataset with the new harmonized codes. According to the results, magyarlanc is able to achieve a state-of-the-art accuracy score on the 2.5 version as well.
Topics	Morphology, Part-of-Speech Tagging
Full paper	Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus
Bibtex	@InProceedings{VINCZE14.262, author = {Veronika Vincze and Viktor Varga and Katalin Ilona Simkó and János Zsibrita and Ágoston Nagy and Richárd Farkas and János Csirik}, title = {Szeged Corpus 2.5: Morphological Modifications in a Manually POS-tagged Hungarian Corpus}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }