Title | Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation |
Author(s) |
Martin Čmejrek, Jan Cuřín, Jiří Havelka, Jan Hajič, Vladislav Kuboň
Institute of Formal and Applied Linguistic, Faculty of Mathematics and Physics, Charles University in Prague, Malostranské nám. 25, 11800 Prague 1, Czech Republic |
Session | O36-SW |
Abstract | This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources - a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated to Czech and its annotation tranformed into dependency annotation scheme. The annotation of Czech was done automatically from plain text. A subset of corresponding Czech and English sentences has been annotated by humans. The resources being created at Charles University in Prague are scheduled for release as Linguistic Data Consortium data collection in 2004. First experiments in Czech-English machine translation using these data were already carried out. |
Keyword(s) | Parallel treebank, resources for machine translation, automatic parsing |
Language(s) | Czech, English |
Full Paper | 745.pdf |