Title | Pumping Documents Through a Domain and Genre Classification Pipeline |
Author(s) |
Udo Hahn, Joachim Wermter
Text Knowledge Engineering Lab, Freiburg University, Werthmannplatz 1, D-79098 Freiburg, Germany |
Session | O16-EW |
Abstract | We propose a simple, yet effective, pipeline architecture for document classification. The task we intend to solve is to classify large and content-wise heterogeneous document streams on a layered nine-category system, which distinguishes medical from non-medical texts and sorts medical texts into various subgenres. While the document classification problem is often dealt with using computationally powerful and, hence, costly classifiers (e.g., Bayesian ones), we have gathered empirical evidence that a much simpler approach based on n-gram-statistics achieves a comparable level of classification performance. |
Keyword(s) | text categorization, medical application, n-gram model, text genre, WWW |
Language(s) | German, language-independent |
Full Paper | 641.pdf |