Title

Measuring corpus homogeneity using a range of measures for inter-document distance

Authors

Gabriela Cavaglia (ITRI, University of Brighton Lewes Road, Brighton BN2 4GJ, United Kingdom)

Session

WP1: Corpora & Corpus Tools

Abstract

With the ever more widespread use of corpora in language research, it is becoming increasingly important to be able to describe and compare corpora. The analysis of corpus homogeneity is preliminary to any quantitative approach to corpora comparison. We describe a method for text analysis based only on document-internal linguistic features, and a set of related homogeneity measures based on inter-document distance. We present a preliminary experiment to validate the hypothesis that in the presence of a homogeneous corpus the subcorpus that is necessary to train an NLP system is smaller than the one required if a heterogeneous corpus is used.Overhead projector

Keywords

Corpus homogeneity, Corpus design, Corpus comparison

Full Paper

232.pdf