Title |
Holaaa!! writin like u talk is kewl but kinda hard 4 NLP |
Authors |
Maite Melero, Marta R. Costa-Jussà, Judith Domingo, Montse Marquina and Martí Quixal |
Abstract |
We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker. |
Topics |
Semantic Web, Tools, systems, applications, Authoring tools, proofing |
Full paper |
Holaaa!! writin like u talk is kewl but kinda hard 4 NLP |
Bibtex |
@InProceedings{MELERO12.627,
author = {Maite Melero and Marta R. Costa-Jussà and Judith Domingo and Montse Marquina and Martí Quixal}, title = {Holaaa!! writin like u talk is kewl but kinda hard 4 NLP}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |