Title |
Bootstrapping a database of German multi-word expressions |
Author(s) |
Alexander Geyken Berlin-Brandenburgische Akademie der Wissenschaften, Jägerstr. 22/23, 10117 Berlin, www.dwds.de, geyken@bbaw.de |
Session |
O24-TW |
Abstract |
We pre-classified 32,000 entries from the {Wörterbuch der deutschen Idiomatik} (Schemann 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression. |
Keyword(s) |
multi-word expressions, collocations, database, acquisition, finite state filter |
Language(s) | German |
Full Paper |