Summary of the paper

Title IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus
Authors Septina Dian Larasati
Abstract This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: ‘plain', stored in text format and ‘morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.
Topics Corpus (creation, annotation, etc.), Morphology, Tools, systems, applications
Full paper IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus
Bibtex @InProceedings{LARASATI12.644,
  author = {Septina Dian Larasati},
  title = {IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }
Powered by ELDA © 2012 ELDA/ELRA