Title

Comparative Evaluation of Collocation Extraction Metrics 

Authors

Aristomenis Thanopoulos (Wire Communications Laboratory, Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece)

Nikos Fakotakis (Wire Communications Laboratory, Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece)

George Kokkinakis (Wire Communications Laboratory, Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece)

Session

EP1: Evaluation

Abstract

Corpus-based automatic extraction of collocations is typically carried out employing some statistic indicating concurrency in order to identify words that co-occur more often than expected by chance. In this paper we are concerned with some typical measures such as the t-score, Pearson’s X-square test, log-likelihood ratio, pointwise mutual information and a novel information theoretic measure, namely mutual dependency. Apart from some theoretical discussion about their correlation, we perform comparative evaluation experiments judging performance by their ability to identify lexically associated bigrams. We use two different gold standards: WordNet and lists of named-entities. Besides discovering that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, we point out some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses.

Keywords

Collocation extraction, Automatic evaluation, WordNet

Full Paper

128.pdf