Summary of the paper

Title Extracting Parallel Sentences from Comparable Corpora with STACC Variants
Authors Andoni Azpeitia, Thierry Etchegoyhen and Eva Martínez Garcia
Abstract This article describes our submissions to the BUCC 2018 shared task on parallel sentence extraction from comparable corpora. Our approach is based on variants of the STACC method, which computes similarity on expanded lexical sets via Jaccard similarity. We apply the weighted variant of the method to all four language pairs of the task, demonstrating the efficiency and portability of the approach. Additionally, we introduce a variant which further penalizes mismatches in terms of named entities, improving over the already strong weighted variant baseline in most cases. Our approach reached the highest results in all scenarios, with scores over 80% in terms of f1-measure and 90% in precision.
Full paper Extracting Parallel Sentences from Comparable Corpora with STACC Variants
Bibtex @InProceedings{AZPEITIA18.6,
  author = {Andoni Azpeitia ,Thierry Etchegoyhen and Eva Martínez Garcia},
  title = {Extracting Parallel Sentences from Comparable Corpora with STACC Variants},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Reinhard Rapp and Pierre Zweigenbaum and Serge Sharoff},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-07-8},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA