Title

A Comparison of Two Variant Corpora: The Same Content with Different Sources

Author(s)

Kyonghee Paik, Kiyonori Ohtake, Kazuhide Yamamoto

ATR Spoken Language Translation Research Laboratories

Session

P19-SW

Abstract

In order to investigate the effect of source language on translations, we investigate two variants of a Korean translation corpus. The first variant consists of Korean translations of 162,308 Japanese sentences from the ATR BTEC (Basic Expression Text Corpus). The second variant was made by translating the English translations of the Japanese sentences into Korean. We show that the source language text has a large influence on the target text. Even after normalizing orthographic differences, fewer than 8.3\% of the sentences in the two variants were identical. We describe in general which phenomena differ and then discuss how our analysis can be used in natural language processing.

Keyword(s)

source language, variants of a corpus, linguistic similarity and difference, similarity score, natural language processing

Language(s)

Korean, Japanese, English

Full Paper

424.pdf