Title | The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study |
Author(s) |
Anthony
McEnery, Zhonghua Xiao
Department of Linguistics, Lancaster University, Lancaster, LA1 4YT, UK {a.mcenery, z.xiao}@lancaster.ac.uk |
Session | P12-W |
Abstract | This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese. |
Keyword(s) | Lancaster Corpus, Corpus, Monolingual, Language Study |
Language(s) | N/A |
Full Paper | 231.pdf |