Title | Collection and Evaluation of Broadcast News Data for Arabic |
Author(s) |
Mohamed Afify (1), Ossama Emam (2)
(1) Department of Infomation Technology, Faculty of Information and Computer, Cairo University; (2) Human Language Technologies Group, IBM Egypt |
Session | O23-SE |
Abstract | This paper focuses on presenting a general methodology for acquiring and automatically segmenting broadcast news data from the web. It was shown that it is possible starting from a relatively small corpus of about 10 hours to segment automatically about 30 hours of data. This step is important because manual segmentation of broadcast news data is generally very tedious and time consuming. In addition to the data collection proposal we show the development of an initial recognition system. We present an automatic procedure for creating vowelizations for Arabic words. This is again important because most available Arabic transcriptions lack vowelization, which is crucial for creating phonetic transcription. The performance of our system is initially 36% error rate. |
Keyword(s) | Broadcast News, Speech Recognition |
Language(s) | Arabic |
Full Paper | 315.pdf |