LREC 2012 Proceedings

Summary of the paper

Title	The WeSearch Corpus, Treebank, and Treecache -- A Comprehensive Sample of User-Generated Content
Authors	Jonathon Read, Dan Flickinger, Rebecca Dridan, Stephan Oepen and Lilja Øvrelid
Abstract	We present the WeSearch Data Collection (WDC)―a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.
Topics	Corpus (creation, annotation, etc.), Parsing, Information Extraction, Information Retrieval
Full paper	The WeSearch Corpus, Treebank, and Treecache -- A Comprehensive Sample of User-Generated Content
Bibtex	@InProceedings{READ12.774, author = {Jonathon Read and Dan Flickinger and Rebecca Dridan and Stephan Oepen and Lilja Øvrelid}, title = {The WeSearch Corpus, Treebank, and Treecache -- A Comprehensive Sample of User-Generated Content}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }