Title |
A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition |
Authors |
Abir Masmoudi, Mariem Ellouze Khmekhem, Yannick Esteve, Lamia Hadrich Belguith and Nizar Habash |
Abstract |
In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%. |
Topics |
Speech Recognition/Understanding, Corpus (Creation, Annotation, etc.) |
Full paper |
A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition |
Bibtex |
@InProceedings{MASMOUDI14.454,
author = {Abir Masmoudi and Mariem Ellouze Khmekhem and Yannick Esteve and Lamia Hadrich Belguith and Nizar Habash}, title = {A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} } |