LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages
Authors Gibbon Dafydd (Fakultät für Linguistik und Literaturwissenschaft, Universitä+C122t Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de)
Quirino Simões Ana Paula (CSLI, Stanford University, CA 94305-4115, USA, aquirino@stanford.edu)
Matthiesen Martin (Lingsoft, Inc., Tehtaankatu 27-29 D, FIN-00150 Helsinki, Finland)
Keywords Finite State Technologies, Grapheme-Phoneme Conversion, Morphology, Morphophonology, Pronunciation, xfst
Session Session SP1 - Phonetic Issues and Speech Synthesis
Full Paper 251.ps, 251.pdf
Abstract We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem.