Title |
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages |
Authors |
Sowmya V. B., Monojit Choudhury, Kalika Bali, Tirthankar Dasgupta and Anupam Basu |
Abstract |
Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman. |
Topics |
Corpus (creation, annotation, etc.), Other |
Full paper |
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages |
Slides |
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages |
Bibtex |
@InProceedings{VB10.182,
author = {Sowmya V. B. and Monojit Choudhury and Kalika Bali and Tirthankar Dasgupta and Anupam Basu}, title = {Resource Creation for Training and Testing of Transliteration Systems for Indian Languages}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |