Title

Title	Linguistic Resources for Effective, Affordable, Reusable Speech-to-Text
Author(s)	Stephanie Strassel University of Pennsylvania, Linguistic Data Consortium
Session	O4-S
Abstract	This paper describes ongoing efforts at Linguistic Data Consortium to create shared evaluation resources for improved speech-to-text technology. The DARPA EARS Program (Effective, Affordable, Reusable Speech-to-Text) is focused on enabling core STT technology to produce rich, highly accurate output in a range of languages and speaking styles. The aggressive EARS program goals motivate new approaches to corpus creation and distribution. EARS research sites require multilingual broadcast news and telephone speech, transcripts and annotations at a much higher volume than for any previous technology program. In response to these demands, LDC has developed new corpora for training and evaluating speech-to-text systems in English, Arabic and Chinese and to support systems that distinguish speakers, identify and repair disfluencies and punctuate a text to improve readability.
Keyword(s)	spoken language resources, speech-to-text, metadata extraction, data centers
Language(s)	English, Chinese, Arabic
Full Paper	762.pdf