Title |
A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora |
Authors |
Mahdi Khademian, Kaveh Taghipour, Saab Mansour and Shahram Khadivi |
Abstract |
Achieving accurate translation, especially in multiple domain documents with statistical machine translation systems, requires more and more bilingual texts and this need becomes more critical when training such systems for language pairs with scarce training data. In the recent years, there have been some researches on new sources of parallel texts that are documents which are not necessarily parallel but are comparable. Since these methods search for possible translation equivalences in a greedy manner, they are unable to consider all possible parallel texts in comparable documents. This paper investigates a different approach for this need by considering relationships between all words of two comparable documents, which works fairly well even in the worst case of comparability. We represent each document pair in a matrix and then transform it to a new space to find parallel fragments. Evaluations show that the system is successful in extraction of useful fragment pairs. |
Topics |
Corpus (creation, annotation, etc.), Text mining, Tools, systems, applications |
Full paper |
A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora |
Bibtex |
@InProceedings{KHADEMIAN12.892,
author = {Mahdi Khademian and Kaveh Taghipour and Saab Mansour and Shahram Khadivi}, title = {A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |