Title |
MultiUN v2: UN Documents with Multilingual Alignments |
Authors |
Yu Chen and Andreas Eisele |
Abstract |
MultiUN is a multilingual parallel corpus extracted from the official documents of the United Nations. It is available in the six official languages of the UN and a small portion of it is also available in German. This paper presents a major update on the first public version of the corpus released in 2010. This version 2 consists of over 513,091 documents, including more than 9% of new documents retrieved from the United Nations official document system. We applied several modifications to the corpus preparation method. In this paper, we describe the methods we used for processing the UN documents and aligning the sentences. The most significant improvement compared to the previous release is the newly added multilingual sentence alignment information. The alignment information is encoded together with the text in XML instead of additional files. Our representation of the sentence alignment allows quick construction of aligned texts parallel in arbitrary number of languages, which is essential for building machine translation systems. |
Topics |
Corpus (creation, annotation, etc.), Multilinguality, Machine Translation, SpeechToSpeech Translation |
Full paper |
MultiUN v2: UN Documents with Multilingual Alignments |
Bibtex |
@InProceedings{CHEN12.641,
author = {Yu Chen and Andreas Eisele}, title = {MultiUN v2: UN Documents with Multilingual Alignments}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |