Title |
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method |
Authors |
Hai Zhao, Yan Song and Chunyu Kit |
Abstract |
We investigate the impact of input data scale in corpus-based learning using a study style of Zipfs law. In our research, Chinese word segmentation is chosen as the study case and a series of experiments are specially conducted for it, in which two types of segmentation techniques, statistical learning and rule-based methods, are examined. The empirical results show that a linear performance improvement in statistical learning requires an exponential increasing of training corpus size at least. As for the rule-based method, an approximate negative inverse relationship between the performance and the size of the input lexicon can be observed. |
Topics |
Corpus (creation, annotation, etc.), Statistical and machine learning methods |
Full paper |
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method |
Slides |
How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method |
Bibtex |
@InProceedings{ZHAO10.199,
author = {Hai Zhao and Yan Song and Chunyu Kit}, title = {How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |