LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers
Authors Witt Andreas (Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, witt@lili.uni-bielefeld.de, Postfach 10 01 31, 33501 Bielefeld, Germany)
Lüngen Harald (Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, luengen@spectrum.uni-bielefeld.de, Postfach 10 01 31, 33501 Bielefeld, Germany)
Gibbon Dafydd (Fakultät für Linguistik und Literaturwissenschaft, Universitä+C122t Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de)
Keywords DSSSL, Morphology, Speech Corpora, Speech Lexica, Text Technology, XML
Session Session SP2 - Spoken Language Resources Issues from Construction to Validation
Full Paper 183.ps, 183.pdf
Abstract We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).