Title |
An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT |
Authors |
Ângela Costa, Tiago Luís, Joana Ribeiro, Ana Cristina Mendes and Luísa Coheur |
Abstract |
The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the goal is to translate other types of sentences, e.g., interrogatives. There have been efforts to create corpora of questions, specially in the context of the evaluation of Question-Answering systems. One of those corpora is the UIUC dataset, composed of nearly 6,000 questions, widely used in the task of Question Classification. In this work, we make available the Portuguese version of the UIUC dataset, which we manually translated, as well as the translation guidelines. We show the impact of this corpus in the performance of a state-of-the-art SMT system when translating questions. Finally, we present a taxonomy of translation errors, according to which we analyze the output of the automatic translation before and after using the corpus as training data. |
Topics |
Corpus (creation, annotation, etc.), Machine Translation, SpeechToSpeech Translation, Question Answering |
Full paper |
An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT |
Bibtex |
@InProceedings{COSTA12.356,
author = {Ângela Costa and Tiago Luís and Joana Ribeiro and Ana Cristina Mendes and Luísa Coheur}, title = {An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |