Summary of the paper

Title The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language
Authors Mike Kestemont, Claudia Peersman, Benny De Decker, Guy De Pauw, Kim Luyckx, Roser Morante, Frederik Vaassen, Janneke van de Loo and Walter Daelemans
Abstract Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this ‘anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.
Topics Corpus (creation, annotation, etc.), Document Classification, Text categorisation, Morphology
Full paper The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language
Bibtex @InProceedings{KESTEMONT12.938,
  author = {Mike Kestemont and Claudia Peersman and Benny De Decker and Guy De Pauw and Kim Luyckx and Roser Morante and Frederik Vaassen and Janneke van de Loo and Walter Daelemans},
  title = {The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }
Powered by ELDA © 2012 ELDA/ELRA