Summary of the paper

Title The Influence of Corpus Quality on Statistical Measurements on Language Resources
Authors Thomas Eckart, Uwe Quasthoff and Dirk Goldhahn
Abstract The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts """"word"""" and """"sentence"""" is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
Topics Corpus (creation, annotation, etc.), Evaluation methodologies, Validation of LRs
Full paper The Influence of Corpus Quality on Statistical Measurements on Language Resources
Bibtex @InProceedings{ECKART12.476,
  author = {Thomas Eckart and Uwe Quasthoff and Dirk Goldhahn},
  title = {The Influence of Corpus Quality on Statistical Measurements on Language Resources},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }
Powered by ELDA © 2012 ELDA/ELRA