LREC 2010 Proceedings

Summary of the paper

Title	MultiUN: A Multilingual Corpus from United Nation Documents
Authors	Andreas Eisele and Yu Chen
Abstract	This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.
Topics	Machine Translation, SpeechToSpeech Translation, Corpus (creation, annotation, etc.), Multilinguality
Full paper	MultiUN: A Multilingual Corpus from United Nation Documents
Slides	MultiUN: A Multilingual Corpus from United Nation Documents
Bibtex	@InProceedings{EISELE10.686, author = {Andreas Eisele and Yu Chen}, title = {MultiUN: A Multilingual Corpus from United Nation Documents}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }