We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.
@InProceedings{ERIKSSON16.214,
author = {Robin Eriksson}, title = {Quality Assessment of the Reuters Vol. 2 Multilingual Corpus}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} }