LREC 2006 - Proceedings sorted by papers

Title	Finding representative sets of dialect words for geographical regions
Authors	M. Salmenkivi
Abstract	We investigate a corpus of geographical distributions of 17,126 Finnish dialect words. Our goal is to automatically find sets of words characteristic to geographical regions. Though our approach is related to the problem of dividing the investigation area into linguistically (and geographically) relatively coherent dialect regions, we do not aim at constructing more or less questionable dialect regions. Instead, we let the boundaries of the regions overlap to get insight to the degree of lexical change between adjacent areas. More concretely, we study the applicability of data clustering approaches to find sets of words with tight spatial distributions, and to cluster the extracted distributions according to their distribution areas. The extracted words belonging to the same cluster can then be utilized as a means to characterize the lexicon of the region. We also automatically pick up words with occurrences appearing in two or more areas that are geographically far from each other. These words may give valuable insight to, e.g., the study of cultural history and history of settlement.
Keywords	dialectometry, dialect, lexical variation, clustering
Full paper	Finding representative sets of dialect words for geographical regions

Title

Finding representative sets of dialect words for geographical regions

Authors

M. Salmenkivi

Abstract

We investigate a corpus of geographical distributions of 17,126 Finnish dialect words. Our goal is to automatically find sets of words characteristic to geographical regions. Though our approach is related to the problem of dividing the investigation area into linguistically (and geographically) relatively coherent dialect regions, we do not aim at constructing more or less questionable dialect regions. Instead, we let the boundaries of the regions overlap to get insight to the degree of lexical change between adjacent areas. More concretely, we study the applicability of data clustering approaches to find sets of words with tight spatial distributions, and to cluster the extracted distributions according to their distribution areas. The extracted words belonging to the same cluster can then be utilized as a means to characterize the lexicon of the region. We also automatically pick up words with occurrences appearing in two or more areas that are geographically far from each other. These words may give valuable insight to, e.g., the study of cultural history and history of settlement.

Keywords

dialectometry, dialect, lexical variation, clustering

Full paper

Finding representative sets of dialect words for geographical regions