
Toward a broad-coverage bilingual corpus for speech translation of travel conversations in the real world


Toshiyuki Takezawa (ATR Spoken Language Translation Research Laboratories)

Eiichiro Sumita (ATR Spoken Language Translation Research Laboratories)

Fumiaki Sugaya (ATR Spoken Language Translation Research Laboratories)

Hirofumi Yamamoto (ATR Spoken Language Translation Research Laboratories)

Seiichi Yamamoto (ATR Spoken Language Translation Research Laboratories)


SO2: Speech To Speech Translation


At ATR Spoken Language Translation Research Laboratories, we are building a broad-coverage bilingual corpus to study corpus-based speech translation technologies for the real world. There are three important points to consider in designing and constructing a corpus for future speech translation research. The first is to have a variety of speech samples, with a wide range of pronunciations and speakers. The second is to have data for a variety of situations. The third is to have a variety of expressions. This paper reports our trials and discusses the methodology. First, we introduce a bilingual travel conversation (TC) corpus of spoken languages and a broad-coverage bilingual basic expression (BE) corpus. TC and BE are designed to be complementary. TC is a collection of transcriptions of bilingual spoken dialogues, while BE is a collection of Japanese sentences and their English translations. Whereas TC covers a small domain, BE covers a wide variety of domains. We compare the characteristics of vocabulary and expressions between these two corpora and suggest that we need a much greater variety of expressions. One promising approach might be to collect paraphrases representing various different expressions generated by many people for similar concepts.


Bilingual corpus, Spoken language, Speech translation, Design and construction methodologies, Paraphrase

Full Paper
