Title

Mining the Web for Discourse Markers

Author(s)

Ben Hutchinson

University of Edinburgh

Session

P5-W

Abstract

This paper proposes a methodology for obtaining sentences containing discourse markers from the World Wide Web. The proposed methodology is particularly suitable for collecting large numbers of discourse marker tokens. It relies on the automatic identification of discourse markers, and we show that this can be done with an accuracy within 9% of that of human performance. We also show that the distribution of discourse markers on the web correlates highly with those in a conventional balanced corpus.

Keyword(s)

World Wide Web, Discourse Markers

Language(s)

English

Full Paper

333.pdf