LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language |
Authors | Kačič Zdravko (Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, kacic@uni-mb.si) Horvat Bogomir (University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia, bogo.horvat@uni-mb.si) Zögling Aleksandra (University of Maribor, Research and Study Centre, Razlagova 22, 2000 Maribor, Slovenia, sandra.zogling@uni-mb.si) |
Keywords | Continuous Speech Recognition over the Telephone, Language Resources, Speech Databases, Speech Dictation Task |
Session | Session SP3 - Spoken Language Resources' Projects |
Full Paper | 246.ps, 246.pdf |
Abstract | In this paper, different issues in design, collection and evaluation of the large vocabulary telephone speech corpus of Slovenian language are discussed. The database is composed of three text corpora containing 1530 different sentences. It contains read speech of 82 speakers where each speaker read in average more than 200 sentences and 21 speakers read also the text passage of 90 sentences. The initial manual segmentation and labeling of speech material was performed. Based on this the automatic segmentation was carried out. The database should facilitate the development of speech recognition systems to be used in dictation tasks over the telephone. Until now the database was used mostly for isolated digit recognition tasks and word spotting. |