SUMMARY : Session P22-W
Title | Open Source Corpus Analysis Tools for Malay |
---|---|
Authors | T. Baldwin, S. Awab |
Abstract | Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text. |
Keywords | sentence tokeniser, lemmatiser, Malay |
Full paper | Open Source Corpus Analysis Tools for Malay |