LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages |
Authors | Gibbon Dafydd (Fakultät für Linguistik und Literaturwissenschaft, Universitä+C122t Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de) Quirino Simões Ana Paula (CSLI, Stanford University, CA 94305-4115, USA, aquirino@stanford.edu) Matthiesen Martin (Lingsoft, Inc., Tehtaankatu 27-29 D, FIN-00150 Helsinki, Finland) |
Keywords | Finite State Technologies, Grapheme-Phoneme Conversion, Morphology, Morphophonology, Pronunciation, xfst |
Session | Session SP1 - Phonetic Issues and Speech Synthesis |
Full Paper | 251.ps, 251.pdf |
Abstract | We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem. |