Summary of the paper

Title How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Authors Hai Zhao, Yan Song and Chunyu Kit
Abstract We investigate the impact of input data scale in corpus-based learning using a study style of Zipf’s law. In our research, Chinese word segmentation is chosen as the study case and a series of experiments are specially conducted for it, in which two types of segmentation techniques, statistical learning and rule-based methods, are examined. The empirical results show that a linear performance improvement in statistical learning requires an exponential increasing of training corpus size at least. As for the rule-based method, an approximate negative inverse relationship between the performance and the size of the input lexicon can be observed.
Topics Corpus (creation, annotation, etc.), Statistical and machine learning methods
Full paper How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Slides How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method
Bibtex @InProceedings{ZHAO10.199,
  author = {Hai Zhao and Yan Song and Chunyu Kit},
  title = {How Large a Corpus Do We Need: Statistical Method Versus Rule-based Method},
  booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA