Title | Open Collaborative Development of the Thai Language Resources for Natural Language Processing |
Author(s) |
Thatsanee Charoenporn (1), Virach Sornlertlamvanich (1), Sawit Kasuriya (2), Chatchawarn Hansakunbuntheung (2), Hitoshi Isahara (1)
(1) Thai Computational Linguistics Laboratory, Communications Research Laboratory, Thailand; (2) National Electronics and Computer Technology Center, Thailand |
Session | P13-W |
Abstract | Language Resources are recognized as an essential component in linguistic infrastructure and a starting point of Natural Language Processing systems and applications. In this paper, we describe the achievement of the development and the use of Thai Language Resources germinated with an open collaboration platform, under the collaboration between research institutes. The resources include either text or speech. Text resources are divided into lexicon database and annotated corpus. We started developing a corpus-based Thai-English lexicon database (LEXiTRON) since 1994. It was originated from a dictionary designed for using in developing a machine translation system. Since then the Thai POS was designed and evaluated in several applications (word segmentation, machine translation, grapheme-to-phoneme, etc.) Extending the lexicon database, POS tagged corpus (ORCHID), and speech corpora for both synthesis and recognition are developed and functioned as an important part of research and development on NLP or HLT. These language resources are available for academic experiment. |
Keyword(s) | Corpus, written-spoken corpus, language resource, text corpus, speech sorpus, tagged corpus |
Language(s) | Thai |
Full Paper | 434.pdf |