LREC 2012 Proceedings

Summary of the paper

Title	Building a multilingual parallel corpus for human users
Authors	Alexandr Rosen and Martin Vavřín
Abstract	We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selection criteria, data format, conversion, alignment, lemmatization and tagging. Finally, we show a sample query using the web-based search interface and discuss challenges and prospects of the project.
Topics	Corpus (creation, annotation, etc.), Multilinguality, Part of speech tagging
Full paper	Building a multilingual parallel corpus for human users
Bibtex	@InProceedings{ROSEN12.200, author = {Alexandr Rosen and Martin Vavřín}, title = {Building a multilingual parallel corpus for human users}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }