Title | Bilingual Connections for Trilingual Corpora: An XML Approach |
Author(s) |
Victoria Arranz, Núria Castell, Josep Maria Crego, Jesús Giménez, Adrià de Gispert, Patrik Lambert
TALP Research Center, Universitat Politècnica de Catalunya, Jordi Girona Salgado, 1-3, 08034 Barcelona, Spain. E-mail:{varranz, castell, jmcrego, jgimenez, agispert, lambert}@talp.upc.es |
Session | P18-S |
Abstract | This paper describes the design and development of a trilingual spontaneous speech corpus for statistical speech-to-speech translation. The languages considered are Catalan, Spanish and US-English. This corpus has been built bearing in mind the strong need for multilingual collections of on-line data within the area of statistical translation, as well as the need for data that are parallel or aligned, that contain different types of linguistic information and that can be used by diferent translation systems. For that reason, our aim has been the creation of a linguistically-enriched resource with an XML-based DTD that allows a useful, transparent and flexible storage of the data. Moreover, these resources are also valuable for a wide range of Natural Language Processing applications, such as multilingual resource acquisition or word sense discrimination, among others. |
Keyword(s) | Multilingual language resources, statistical machine translation, DTD, XML |
Language(s) | Catalan, Spanish, US-English |
Full Paper | 649.pdf |