Title |
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance |
Authors |
Siim Orasmaa, Reina Käärik, Jaak Vilo and Tiit Hennoste |
Abstract |
An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total. |
Topics |
Corpus (creation, annotation, etc.), Tools, systems, applications, Lexicon, lexical database |
Full paper |
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance |
Slides |
- |
Bibtex |
@InProceedings{ORASMAA10.600,
author = {Siim Orasmaa and Reina Käärik and Jaak Vilo and Tiit Hennoste}, title = {Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |