Title |
Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus |
Authors |
Martin Reynaert, Nelleke Oostdijk, Orphée De Clercq, Henk van den Heuvel and Franciska de Jong |
Abstract |
In The Low Countries, a major reference corpus for written Dutch is being built. We discuss the interplay between data acquisition and data processing during the creation of the SoNaR Corpus. Based on developments in traditional corpus compiling and new web harvesting approaches, SoNaR is designed to contain 500 million words, balanced over 36 text types including both traditional and new media texts. Beside its balanced design, every text sample included in SoNaR will have its IPR issues settled to the largest extent possible. This data collection task presents many challenges because every decision taken on the level of text acquisition has ramifications for the level of processing and the general usability of the corpus. As far as the traditional text types are concerned, each text brings its own processing requirements and issues. For new media texts - SMS, chat - the problem is even more complex, issues such as anonimity, recognizability and citation right, all present problems that have to be tackled. The solutions actually lead to the creation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes, and the smaller - of commissioned size - more privacy compliant SoNaR, IPR-cleared for commercial purposes as well. |
Topics |
Corpus (creation, annotation, etc.), Acquisition, Discourse annotation, representation and processing |
Full paper |
Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus |
Slides |
- |
Bibtex |
@InProceedings{REYNAERT10.549,
author = {Martin Reynaert and Nelleke Oostdijk and Orphée De Clercq and Henk van den Heuvel and Franciska de Jong}, title = {Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |