Title |
Constructing a Chinese―Japanese Parallel Corpus from Wikipedia |
Authors |
Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi |
Abstract |
Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/~chu/resource/wiki_zh_ja.tgz. |
Topics |
Corpus (Creation, Annotation, etc.), Multilinguality |
Full paper |
Constructing a Chinese―Japanese Parallel Corpus from Wikipedia |
Bibtex |
@InProceedings{CHU14.21,
author = {Chenhui Chu and Toshiaki Nakazawa and Sadao Kurohashi}, title = {Constructing a Chinese―Japanese Parallel Corpus from Wikipedia}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} } |