Title |
Data Issues in English-to-Hindi Machine Translation |
Authors |
Ondřej Bojar, Pavel Straňák and Daniel Zeman |
Abstract |
Statistical machine translation to morphologically richer languages is a challenging task and more so if the source and target languages differ in word order. Current state-of-the-art MT systems thus deliver mediocre results. Adding more parallel data often helps improve the results; if it doesn't, it may be caused by various problems such as different domains, bad alignment or noise in the new data. In this paper we evaluate the English-to-Hindi MT task from this data perspective. We discuss several available parallel data sources and provide cross-evaluation results on their combinations using two freely available statistical MT systems. We demonstrate various problems encountered in the data and describe automatic methods of data cleaning and normalization. We also show that the contents of two independently distributed data sets can unexpectedly overlap, which negatively affects translation quality. Together with the error analysis, we also present a new tool for viewing aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the target language. |
Topics |
Machine Translation, SpeechToSpeech Translation, Evaluation methodologies, Corpus (creation, annotation, etc.) |
Full paper |
Data Issues in English-to-Hindi Machine Translation |
Slides |
- |
Bibtex |
@InProceedings{BOJAR10.756,
author = {Ondřej Bojar and Pavel Straňák and Daniel Zeman}, title = {Data Issues in English-to-Hindi Machine Translation}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |