As a member of the Malayo-Polynesian languages, Indonesian is spoken by a large population. However, language resources and processing tools for Indonesian are quite limited. Part-of-speech (POS) tagging aims to assign a particular POS to a word, concerning its distribution and function in the context, which can provide valuable information for most natural language processing tasks. This work introduces our work on designing an Indonesian part-of-speech (POS) tagset, including 29 tags, and constructing a large Indonesian POS corpus comprised of over 355,000 tokens. During the designing and annotation processes, we make judgments more from a typological perspective, following the specifications of Universal Dependencies, while not missing those language-specific phenomena. In addition, we try to utilize several state-of-the-art sequence labeling models, trained on the proposed corpus, to implement automatic POS tagging, and the experiment results are favorable, with the accuracies higher than 94%.
@InProceedings{FU18.3, author = {Sihui Fu and Nankai Lin}, title = {Towards Indonesian Part-of-Speech Tagging: Corpus and Models }, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Erhong Yang and Le Sun}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-29-0}, language = {english} }