Summary of the paper

Title Creating Lithuanian and Latvian Speech Corpora from Inaccurately Annotated Web Data
Authors Askars Salimbajevs
Abstract This paper describes the method that was used to produce additional acoustic model training data for the less-resourced languages of Lithuanian and Latvian. The method uses existing baseline speech recognition systems for Latvian and Lithuanian to align audio data from the Web with imprecise non-normalised transcripts. From 690 hours of Web data (300h for Latvian, 390h for Lithuanian), we have created additional 378 hours of training data (186h for Latvian and 192 for Lithuanian). Combining this additional data with baseline training data allowed to significantly improve word error rate for Lithuanian from 40% to 23%. Word error rate for the Latvian system was improved from 19% to 17%.
Topics Speech Resource/Database, Corpus (Creation, Annotation, Etc.), Speech Recognition/Understanding
Full paper Creating Lithuanian and Latvian Speech Corpora from Inaccurately Annotated Web Data
Bibtex @InProceedings{SALIMBAJEVS18.258,
  author = {Askars Salimbajevs},
  title = "{Creating Lithuanian and Latvian Speech Corpora from Inaccurately Annotated Web Data}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA