LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers
Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Previous Paper Next Paper
Title | Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers |
Authors |
Witt Andreas (Fakultat fur Linguistik und Literaturwissenschaft, Universitat Bielefeld, witt@lili.uni-bielefeld.de, Postfach 10 01 31, 33501 Bielefeld, Germany) Lungen Harald (Fakultat fur Linguistik und Literaturwissenschaft, Universitat Bielefeld, luengen@spectrum.uni-bielefeld.de, Postfach 10 01 31, 33501 Bielefeld, Germany) Gibbon Dafydd (Fakultat fur Linguistik und Literaturwissenschaft, Universita+C122t Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de) |
Keywords | DSSSL, Morphology, Speech Corpora, Speech Lexica, Text Technology, XML |
Session | Session SP2 - Spoken Language Resources Issues from Construction to Validation |
Abstract | We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types). |