Title |
The American National Corpus: More Than the Web Can Provide |
Authors |
Nancy Ide (Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA) Randi Reppen (Department of English Northern Arizona University Flagstaff, Arizona USA) Keith Suderman (Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA) |
Session |
WO8: Written Corpora |
Abstract |
The American National Corpus (ANC) project is developing a corpus comparable to the British National Corpus (BNC), covering American English. Recent interest in the web as a source of corpus materials has caused some in the language processing community to suggest that the development of a corpus of American English is unnecessary. However, we argue that far from being rendered superfluous by the availability of web materials, the ANC is likely to provide a resource for developing web acquisition techniques to support tasks such as genre and language detection and automatic annotation. This paper presents a comparison of the ANC in terms of both content and format with a test corpus compiled from web data, and a discussion of points of intersection and divergence. |
Keywords |
Corpus building, World wide web |
Full Paper |