Title |
The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages. |
Author(s) |
Emanuela Cresti (1), Fernanda Bacelar do Nascimento (2), Antonio Moreno Sandoval (3), Jean Veronis(4), Philippe Martin(5), Khalid Choukri(6) (1) LABLITA, Dipartimento di Italianistica, Università di Firenze; (2)Centro de Linguistica da Universidade de Lisboa; (3)Laboratorio de Lingüística Informática, Departemento de Linguistica, Universidad Autonoma de Madrid; (4) Description Linguistique Informatizée sur Corpus, Université de Provence; (5) Pitch Instruments France; (6)European Language Distribution Agency, European Language Association Agency (ELDA) |
Session |
P9-SE |
Abstract |
The C-ORAL-ROM project has delivered a multilingual corpus of spontaneous speech for the main romance languages (Italian, French, Portuguese and Spanish). The collection aims to represent the variety of speech acts performed in everyday language and to enable the description of prosodic and syntactic structures in the four romance languages. Sampling criteria are defined in a corpus design scheme. C-ORAL-ROM adopts two different sampling strategies, one for the formal and one for the informal part: While a set of typical domains of application is selected to document the formal use of language, the informal part documents speech variation using parameters referring to the event’s structure (dialogue vs. monologue) and the sociological domain of use (family-private vs public). The four romance corpora are tagged with respect to terminal and non terminal prosodic breaks. Terminal breaks are assumed to be the more relevant cues for the identification of relevant linguistic domains in spontaneous speech (utterances). Relations with other concurrent criteria are discussed. The multimedia storage of the C-ORAL-ROM corpus is based on this principle; each textual string ending with a terminal break is aligned, through the Win Pitch speech software, to its acoustic counterpart, generating the data base of all utterances. |
Keyword(s) |
Spoken corpora, multilinguality, romance languages, prosody, multimedia |
Language(s) |
Italian, French, Portuguese, Spanish |
Full Paper |