Title |
Word Segmentation in the Spoken Dutch Corpus |
Authors |
Jean-Pierre Martens (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium) Diana Binnenpoorte (Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands) Kris Demuynck (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium) Ruben Van Parys (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium) Tom Laureys (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium) Wim Goedertier (ELIS, University of Ghent, Sint-Pietersnieuwstraat 41 B-9000 Ghent, Belgium) Jacques Duchateau (ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium) |
Session |
SO7: Tools For Spoken LRs |
Abstract |
This paper describes the aims of the word segmentation in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually veried segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory veried phonetic transcription, and the development of a protocol for the manual verication of tha tautomatic segmentation. The paper also mentions some gures concerning the manual verication of the rst hundred thousand words. |
Keywords |
Word segmentation |
Full Paper |