LREC 2016 Proceedings

Summary of the paper

Title	Exploiting a Large Strongly Comparable Corpus
Authors	Thierry Etchegoyhen, Andoni Azpeitia and Naiara Pérez
Abstract	This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.
Topics	Corpus (Creation, Annotation, etc.), Information Extraction, Information Retrieval, Machine Translation, SpeechToSpeech Translation
Full paper	Exploiting a Large Strongly Comparable Corpus
Bibtex	@InProceedings{ETCHEGOYHEN16.394, author = {Thierry Etchegoyhen and Andoni Azpeitia and Naiara Pérez}, title = {Exploiting a Large Strongly Comparable Corpus}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} }