Title Phonetically Distributed Continuous Speech Corpus for Thai Language
Authors Chai Wutiwiwatchai (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand)

Patcharika Cotsomrong (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand)

Sinaporn Suebvisai (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand)

Supphanat Kanongphara (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand)

Session SP2: Speech Varieties And Multilingual ASR
Abstract

This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language. Firstly, a protocol of Thai phonetic transcription and some essential rules of phonetic correction after grapheme-to-phoneme (G2P) process are described. An iterative procedure of PB and PD sentence selection is conducted in order to avoid tedious work of manual phone correction on all initial sentences. A standard text corpus, ORCHID, was chosen for the initial text. Analysis of several attributes such as the number of words, syllables, monophones and biphones, phone's distribution, etc., in both the PB and PD sets are reported. At the end, the final selected PB are partially compared to the American English TIMIT's PB set (MIT-450) and the Japanese ATR's 503 PB set.

Keywords Phonetically distributed (PD) set, Phonetically balanced (PB) set, Large vocabulary continuous speech recognition (LVCSR) corpus
Full Paper 342.pdf