Summary of the paper

Title Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Authors Maite Melero, Marta R. Costa-Jussà, Judith Domingo, Montse Marquina and Martí Quixal
Abstract We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.
Topics Semantic Web, Tools, systems, applications, Authoring tools, proofing
Full paper Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Bibtex @InProceedings{MELERO12.627,
  author = {Maite Melero and Marta R. Costa-Jussà and Judith Domingo and Montse Marquina and Martí Quixal},
  title = {Holaaa!! writin like u talk is kewl but kinda hard 4 NLP},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }
Powered by ELDA © 2012 ELDA/ELRA