Title

Using Parallel Corpora to enrich Multilingual Lexical Resources

Authors

Dominic Widdows (Center for the Study of Language and Information Stanford University, California)

Beate Dorow (Center for the Study of Language and Information Stanford University, California)

Chiu-Ki Chan (Center for the Study of Language and Information Stanford University, California)

Session

WO3: Acquisition Of Lexical Information

Abstract

This paper describes the use of a bilingual vector model for the automatic discovery of German translations of English terms. The model is built by analysing co-occurence patterns in a parallel corpus of English and German medical abstracts, a method also used for Cross- Lingual Information Retrieval. The model generates candidate German translations of English words using the cosine similarity measure between terms in the bilingual vector space. The correct translations could be added to UMLS, the multilingual dictionary in question. The accuracy of the translations is evaluated by measuring how many of the existing UMLS translations are correctly predicted by the vector translations. The model also detects synonymy, particularly acronyms. An online public demonstration of the model is available.

Keywords

Multilingual lexical resources

Full Paper

103.pdf