LREC 2010 Proceedings

Summary of the paper

Title	A Fully Annotated Corpus of Russian Speech
Authors	Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova and Vera Evdokimova
Abstract	The paper introduces CORPRES ― a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, narrow and wide phonetic transcription, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 528 458 running words and contains 60 hours of speech made up of 7.5 hours from each speaker. 40% of the corpus was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic, prosodic and ideal phonetic transcription for this part was generated and stored as text files. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. The paper contains information about CORPRES design and annotation principles, overall data description and some speculation about possible use of the corpus.
Topics	Corpus (creation, annotation, etc.), Phonetic Databases, Phonology, Speech Synthesis
Full paper	A Fully Annotated Corpus of Russian Speech
Slides	A Fully Annotated Corpus of Russian Speech
Bibtex	@InProceedings{SKRELIN10.274, author = {Pavel Skrelin and Nina Volskaya and Daniil Kocharov and Karina Evgrafova and Olga Glotova and Vera Evdokimova}, title = {A Fully Annotated Corpus of Russian Speech}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }