Title |
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project |
Authors |
Yi Liu, Pascale Fung, Yongsheng Yang, Denise DiPersio, Meghan Glenn, Strassel Stephanie and Christopher Cieri |
Abstract |
In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance. |
Topics |
Speech resource/database, Corpus (creation, annotation, etc.) |
Full paper |
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project |
Slides |
- |
Bibtex |
@InProceedings{LIU10.664,
author = {Yi Liu and Pascale Fung and Yongsheng Yang and Denise DiPersio and Meghan Glenn and Strassel Stephanie and Christopher Cieri}, title = {A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |