Title

Title	Frequent Term Distribution Measures for Dataset Profiling
Author(s)	Anne De Roeck, Avik Sarkar, Paul Garthwaite The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK
Session	O39-EW
Abstract	We motivate the need for dataset profiling in the context of evaluation, and show that textual datasets differ in ways that challenge assumptions about the applicability of techniques. We set out some criteria for useful profiling measures. We argue that distribution patterns of frequent words are useful in profiling genre, and report on a series of experiments with ?2 based measures on the TIPSTER collection, and on textual intranet data. Findings show substantial differences in the distribution of very frequent terms across datasets.
Keyword(s)	Homogeneity measures, term distribution, dataset profiling, evaluation
Language(s)	English
Full Paper	629.pdf