LREC 2012 Proceedings

Summary of the paper

Title	ROMBAC: The Romanian Balanced Annotated Corpus
Authors	Radu Ion, Elena Irimia, Dan Ștefănescu and Dan Tufiș
Abstract	This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.
Topics	Corpus (creation, annotation, etc.), Part of speech tagging, Standards for LRs
Full paper	ROMBAC: The Romanian Balanced Annotated Corpus
Bibtex	@InProceedings{ION12.218, author = {Radu Ion and Elena Irimia and Dan Ștefănescu and Dan Tufiș}, title = {ROMBAC: The Romanian Balanced Annotated Corpus}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }