W11 2018 Proceedings

Summary of the paper

Title	Automatic Identification of Closely-related Indian Languages: Resources and Experiments
Authors	Ritesh Kumar, Bornini Lahiri, Deepak Alok, Atul Kr. Ojha, Mayank Jain, Abdul Basit and Yogesh Dawar
Abstract	In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India-Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state-of-the-art accuracy of 96.48 %. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of ‘closeness’ of these languages.
Topics	Closely-Related Language, Indo-Aryan, Language Identification
Full paper	Automatic Identification of Closely-related Indian Languages: Resources and Experiments
Bibtex	@InProceedings{KUMAR18.26, author = {Ritesh Kumar ,Bornini Lahiri ,Deepak Alok ,Atul Kr. Ojha ,Mayank Jain ,Abdul Basit and Yogesh Dawar}, title = {Automatic Identification of Closely-related Indian Languages: Resources and Experiments}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Girish Nath Jha and Kalika Bali and Sobha L and Atul Kr. Ojha}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-09-2}, language = {english} }