Title |
Theres no Data like More Data? Revisiting the Impact of Data Size on a Classification Task |
Authors |
Ines Rehbein and Josef Ruppenhofer |
Abstract |
In the paper we investigate the impact of data size on a Word Sense Disambiguation task (WSD). We question the assumption that the knowledge acquisition bottleneck, which is known as one of the major challenges for WSD, can be solved by simply obtaining more and more training data. Our case study on 1,000 manually annotated instances of the German verb ""drohen"" (threaten) shows that the best performance is not obtained when training on the full data set, but by carefully selecting new training instances with regard to their informativeness for the learning process (Active Learning). We present a thorough evaluation of the impact of different sampling methods on the data sets and propose an improved method for uncertainty sampling which dynamically adapts the selection of new instances to the learning progress of the classifier, resulting in more robust results during the initial stages of learning. A qualitative error analysis identifies problems for automatic WSD and discusses the reasons for the great gap in performance between human annotators and our automatic WSD system. |
Topics |
Word Sense Disambiguation, Tools, systems, applications, Statistical and machine learning methods |
Full paper |
Theres no Data like More Data? Revisiting the Impact of Data Size on a Classification Task |
Slides |
- |
Bibtex |
@InProceedings{REHBEIN10.806,
author = {Ines Rehbein and Josef Ruppenhofer}, title = {Theres no Data like More Data? Revisiting the Impact of Data Size on a Classification Task}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |