Title | Phonetically Distributed Continuous Speech Corpus for Thai Language |
Authors | Chai Wutiwiwatchai (Information Research and Development Unit
National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin
Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand)
Patcharika Cotsomrong (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand) Sinaporn Suebvisai (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand) Supphanat Kanongphara (Information Research and Development Unit National Electronics and Computer Technology Center 112 Thailand Science Park , Paholyothin Rd., Klong 1 Klong Luang , Pathumthani 12120 Thailand) |
Session | SP2: Speech Varieties And Multilingual ASR |
Abstract |
This paper proposes a work on phonetically balanced sentence (PB) and phonetically distributed sentence (PD) set, which are parts of the text prompt for speech recording in Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language. Firstly, a protocol of Thai phonetic transcription and some essential rules of phonetic correction after grapheme-to-phoneme (G2P) process are described. An iterative procedure of PB and PD sentence selection is conducted in order to avoid tedious work of manual phone correction on all initial sentences. A standard text corpus, ORCHID, was chosen for the initial text. Analysis of several attributes such as the number of words, syllables, monophones and biphones, phone's distribution, etc., in both the PB and PD sets are reported. At the end, the final selected PB are partially compared to the American English TIMIT's PB set (MIT-450) and the Japanese ATR's 503 PB set. |
Keywords | Phonetically distributed (PD) set, Phonetically balanced (PB) set, Large vocabulary continuous speech recognition (LVCSR) corpus |
Full Paper | 342.pdf |