Summary of the paper

Title A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Authors Pierre Godard, Gilles Adda, Martine Adda-Decker, Juan Benjumea, Laurent Besacier, Jamison Cooper-Leavitt, Guy-Noel Kouarata, Lori Lamel, Hélène Maynard, Markus Mueller, Annie Rialland, Sebastian Stueker, François Yvon and Marcely Zanon Boito
Abstract Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources and some even lack a stable orthography. Building systems under these almost zero resource conditions is not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered, unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We detail how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.
Topics Speech Resource/Database, Endangered Languages, Corpus (Creation, Annotation, Etc.)
Full paper A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Bibtex @InProceedings{GODARD18.694,
  author = {Pierre Godard and Gilles Adda and Martine Adda-Decker and Juan Benjumea and Laurent Besacier and Jamison Cooper-Leavitt and Guy-Noel Kouarata and Lori Lamel and Hélène Maynard and Markus Mueller and Annie Rialland and Sebastian Stueker and François Yvon and Marcely Zanon Boito},
  title = "{A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA