Title

Title	Open Collaborative Development of the Thai Language Resources for Natural Language Processing
Author(s)	Thatsanee Charoenporn (1), Virach Sornlertlamvanich (1), Sawit Kasuriya (2), Chatchawarn Hansakunbuntheung (2), Hitoshi Isahara (1) (1) Thai Computational Linguistics Laboratory, Communications Research Laboratory, Thailand; (2) National Electronics and Computer Technology Center, Thailand
Session	P13-W
Abstract	Language Resources are recognized as an essential component in linguistic infrastructure and a starting point of Natural Language Processing systems and applications. In this paper, we describe the achievement of the development and the use of Thai Language Resources germinated with an open collaboration platform, under the collaboration between research institutes. The resources include either text or speech. Text resources are divided into lexicon database and annotated corpus. We started developing a corpus-based Thai-English lexicon database (LEXiTRON) since 1994. It was originated from a dictionary designed for using in developing a machine translation system. Since then the Thai POS was designed and evaluated in several applications (word segmentation, machine translation, grapheme-to-phoneme, etc.) Extending the lexicon database, POS tagged corpus (ORCHID), and speech corpora for both synthesis and recognition are developed and functioned as an important part of research and development on NLP or HLT. These language resources are available for academic experiment.
Keyword(s)	Corpus, written-spoken corpus, language resource, text corpus, speech sorpus, tagged corpus
Language(s)	Thai
Full Paper	434.pdf