Title |
The Problems of Language Identification within Hugely Multilingual Data Sets |
Authors |
Fei Xia, Carrie Lewis and William D. Lewis |
Abstract |
As the data for more and more languages is finding its way into digital form, with an increasing amount of this data being posted to the Web, it has become possible to collect language data from the Web and create large multilingual resources, covering hundreds or even thousands of languages. ODIN, the Online Database of INterlinear text (Lewis, 2006), is such a resource. It currently consists of nearly 200,000 data points for over 1,000 languages, the data for which was harvested from linguistic documents on the Web. We identify a number of issues with language identification for such broad-coverage resources including the lack of training data, ambiguous language names, incomplete language code sets, and incorrect uses of language names and codes. After providing a short overview of existing language code sets maintained by the linguistic community, we discuss what linguists and the linguistic community can do to make the process of language identification easier. |
Topics |
Corpus (creation, annotation, etc.), Multilinguality, Endangered languages |
Full paper |
The Problems of Language Identification within Hugely Multilingual Data Sets |
Slides |
The Problems of Language Identification within Hugely Multilingual Data Sets |
Bibtex |
@InProceedings{XIA10.921,
author = {Fei Xia and Carrie Lewis and William D. Lewis}, title = {The Problems of Language Identification within Hugely Multilingual Data Sets}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |