LREC 2010 Proceedings

Summary of the paper

Title	Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Authors	Yulia Tsvetkov and Shuly Wintner
Abstract	Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.
Topics	Corpus (creation, annotation, etc.), Multilinguality, Machine Translation, SpeechToSpeech Translation
Full paper	Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Slides	-
Bibtex	@InProceedings{TSVETKOV10.40, author = {Yulia Tsvetkov and Shuly Wintner}, title = {Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }