Title

Title	Bootstrapping a database of German multi-word expressions
Author(s)	Alexander Geyken Berlin-Brandenburgische Akademie der Wissenschaften, Jägerstr. 22/23, 10117 Berlin, www.dwds.de, geyken@bbaw.de
Session	O24-TW
Abstract	We pre-classified 32,000 entries from the {Wörterbuch der deutschen Idiomatik} (Schemann 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression.
Keyword(s)	multi-word expressions, collocations, database, acquisition, finite state filter
Language(s)	German
Full Paper	595.pdf