Title |
New language resources for the Pashto language |
Authors |
Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane |
Abstract |
This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation. |
Topics |
Corpus (creation, annotation, etc.), Machine Translation, SpeechToSpeech Translation, Speech resource/database |
Full paper |
New language resources for the Pashto language |
Bibtex |
@InProceedings{MOSTEFA12.824,
author = {Djamel Mostefa and Khalid Choukri and Sylvie Brunessaux and Karim Boudahmane}, title = {New language resources for the Pashto language}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |