LREC 2010 Proceedings

Summary of the paper

Title	Using Comparable Corpora to Adapt a Translation Model to Domains
Authors	Hiroyuki Kaji, Takashi Tsunakawa and Daisuke Okada
Abstract	Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words associated with a source-language word, presently restricted to a noun, and its translations; word translation pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the parameters and an extension to estimate translation pseudo-probabilities for verbs.
Topics	Machine Translation, SpeechToSpeech Translation, Statistical and machine learning methods, Word Sense Disambiguation
Full paper	Using Comparable Corpora to Adapt a Translation Model to Domains
Slides	Using Comparable Corpora to Adapt a Translation Model to Domains
Bibtex	@InProceedings{KAJI10.443, author = {Hiroyuki Kaji and Takashi Tsunakawa and Daisuke Okada}, title = {Using Comparable Corpora to Adapt a Translation Model to Domains}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }