Experience on the development of a language independent automatic Segmentation and labeling system on the frame of the BABEL project: A NEURAL NETWORK BASED SEGMENTATION TECHNIQUE FOR CONTINUOUS SPEECH RECOGNITION

K. Vicsi, A. Vig, G. Gordos

ABSTRACT
In the framework of BABEL project titled "A Multilingual Data-base Collection" the task was not only the collection of a large speech corpora, but also the segmentation of continuously spoken paragraphs on phonetic level. Automatic segmentation technique was needed to help our work. There are some tools prepared before for that purposes, but generally those were prepared for one language, and/or based on the excellent but very expensive HTK program.

The aim of our work is to develop such an automatic segmentation system what is useable for many European languages, giving a good help in the segmentation and labelling of the clear speech databases. So different languages in BABEL project and EUROM 1 project were examined and find an optimal solution.

CONCEPT
In the BABEL project the principles of the SAMPA phonotypical trascription conventions was used [6]. It means that all partners use this transcription for their languages. Thus, if a phoneme set of a language is transcribed into the international SAMPA characters, and SAMPA transcription of paragraphs are given, the automatic segmentation works, and gives good result for English, German, Estonian, Bulgarian and Hungarian. The segmentation method uses the so-called broad phonetics classification, it gives the opportunity of developing a system which is good for many languages.

Our concept for automatic labelling is the following: if you know the labels of the examined sentences, and you can automatically segment the acoustically quasi homogenous parts in the examined sentences, the labelling of each single segment is executable.

RESULTS OF THE AUTOMATIC SEGMENTATION
The hand made and the automatic segmentation of the same sentences were compared with each other, and presented at the same time, so these two results are comparable.

Same faces were examined:


Language

Hungarian (H)

 

German (G)

 

English (E)

 

Bulgarian (B)

Type of

training material

Hungarian

Mixed

H_E_B

German

Mixed

H_E_B

English

Mixed

H_E_B

Bulgarian

Mixed

H_E_B

resonant constant

76

78

75

69

83

77

86

85

spirant

constant

88

87

96

92

95

97

94

96

all phonemes

85

86

82

78

83

83

89

89

Table 1. The influence of the type of the training material (the occurrence of the automatic boundaries within ±25ms from the hand made one) in case of many languages


The results were prepared on the training of of the net by 4 paragraph (2 women and 2men, 20 sentences), and examined by 4 paragraph (2 women and 2men, 20 sentences), differing from the trained examples.

Back to Programme ML>