Title |
Lexical token alignment: experiments, results and applications |
Authors |
Dan Tufiş (RACAI, 13 Septembrie, 13, Bucharest 1, Romania) Ana-Maria Barbu (RACAI, 13 Septembrie, 13, Bucharest 1, Romania) |
Session |
WP1: Corpora & Corpus Tools |
Abstract |
Lexical alignment is one of the most challenging tasks in processing and exploiting parallel texts. There are numerous applications that may benefit from an accurate multilingual lexical alignment of bi- and multi-language corpora. We describe in this paper a hypothesistesting approach to the problem of automatic extraction of translation equivalents from sentence-aligned and tagged parallel corpora. The algorithm was used for automatic extraction of 6 bi-lingual lexicons with English as source language and Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene as the target one, as well as a 7-language lexicon with English as a hub and the other 6 CEE languages. For the experiments described here we used the 7-language aligned corpus based on Orwell’s "1984" novel. |
Keywords |
Lexical token alignment |
Full Paper |