ABSTRACT
Introduction
For the implementation of the speech processing technology, i.e. speech
recognition and speaker verification, language specific spoken language
resources, i.e. speech databases, lexica and related tools, are needed.
In order to be competitive with American companies, European companies
have to create an effective
infrastructure to deal successfully with their multilingual environment.
The EU-projects SpeechDat(M), SpeechDat(II), SpeechDat(Car) and ELRA are
part of such an infrastructure to create, validate and distribute spoken
language resources. These projects were focused on Western European languages.
Responding on the fast growing trade between Eastern and Western Europe this
infrastructure has to be extended to Eastern European languages.
In this spirit the proposed project SpeechDat(E) has its focus on the
creation of spoken language resources for Eastern European languages,
namely for Russian, Czech, and Slovak.
Project organization
The SpeechDat(E) project will be carried out within the COPERNICUS framework.
Project duration will be 2 years starting in 998.
The SpeechDat(E) consortium consists of 3 industrial contractors and
3 academic contractors.
Siemens AG (Germany) acts as Project Coordinator, whereas AudiTech (Russia)
will acts Scientific Coordinator.
The project focuses on the following databases:
Content and creation
All databases are recorded on telephone servers with ISDN connections.
The signal format is 8bit 8KHz alaw, the European ISDN standard.
For the annotation, the SAM file format has been chosen for two reasons: it
separates signal from annnotation data, and it is extensible. The
annotations are encoded in ISO-889, and a common SAM file format has been
defined for each of the three types of databases. The file system hierarchy
is based on purely formal criteria, i.e. it is not content-related.
All SpeechDat databases can be addressed consistently in one large file
system. File names follow the 8.3 character pattern of ISO-966 for
platform independence.
There is a large core content common to all SpeechDat databases. It
consists of approximately 4 items that cover application words and phrases
like digit strings, and phonetically rich words and sentences.
The utterances will be annotated orthographically. Annotation is enriched
with a set of markers for noises and deviations like mispronunciations and
recording truncations.
Speaker recruitment is left to the individual partners.
Validation
The SpeechDat project is featured by a thorough validation protocol. The
specifications which the databases should meet are evaluated by an
independent validation centre, SPEX, being associated contractor of the
project.
Validation proceeds in three steps:
Current status
The SpeechDat(E) project was approved as Joint Research Project of the
INCO-COPERNICUS Work Programme recently.
Meanwhile, the Russian partner has collected a speaker database
as an Invited (non-funded) Guest Partner of SpeechDat(II). It will be
taken care of that there will be no speaker overlap with SpeechDat(E)
recordings. This Russian database comprises speakers from Moscow
and from St.Petersburg. The database was completed according to
the specifications of the SpeechDat(II) project. Speech material
(answers to items) consists of spontaneous answers, reading digit
sequences and text material (words and phrases). The total vocabulary is
about , units. Recording was carried out through ISDN lines.
The phoneme transcription for the lexicon was fulfilled according to the
Russian SAMPA table developed according to the requirements of St.Petersburg's
phonetic school.
Outlook
The basic strategies in carrying out such a project for collecting large
speech databases are currently adopted by the project SpeechDat(Car), and
the SALA project (SpeechDat Across Latin America) for collecting Spanish
and Portuguese databases covering Latin American countries.