LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor - |
Authors | Shudo Kosho (Faculty of Engineering, Fukuoka University, Fukuoka 814-0180, Japan, shudo@tl.fukuoka-u.ac.jp) Takahashi Masahito (Faculty of Engineering, Fukuoka University, Fukuoka 814-0180, Japan, takahasi@helio.tl.fukuoka-u.ac.jp) Koyama Yasuo (aisoft co, 2-1-27, Chuou, Matsumoto, Nagano 390-0811, Japan, koyama@aisoft.co.jp) Yoshimura Kenji (Faculty of Engineering, Fukuoka University, Fukuoka 814-0180, Japan, yosimura@tl.fukuoka-u.ac.jp) |
Keywords | Collocation, Idiom, Kana-to-Kanji Conversion |
Session | Session WP9 - Applications using Written Language Resources |
Full Paper | 2.ps, 2.pdf |
Abstract | Collocations, the com bination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exem plifying an application to K anji character decision processes for Japanese w ord processors. U nlike recent trials of autom atic extraction, our collocations were collected m anually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper com bination of words in a corpus and (2) recollecting similar com binations of words, incited by it. This procedure, which depends on hum an judgm ent and the enrichm ent of data by association, is effective for rem edying the sparseness of data problem , although the arbitrariness of hum an judgm ent is inevitable. A pproximately seventy two thousand and four hundred collocations w ere used as w ord co-occurrence restriction data for deciding K anji characters in the processing of Japanese w ord processores. Experiments have show n that the collocation data yield 8.9% higher fraction of Kana-to-Kanji character conversion accuracy than the system w hich uses no collocation data and 7.0% higher, than a com m ercial word processor software of average perform ance. |