In the last several years, the research on Natural Language Processing (NLP) on Arabic Language has garnered significant attention. Almost all Arabic text is in Modern Standard Arabic (MSA) because Arab people are writing in MSA at all formal situations, except in informal situations such as social media. Social Media is a particularly good resource to collect Arabic dialect text for NLP research. The lack of Arabic dialect corpora in comparison with what is available in dialects of English and other languages, showed the need to create dialect corpora for use in Arabic dialect processing. The objective of this work is to build an Arabic dialect text corpus using Twitter, and Online comments from newspaper and Facebook. Then, create an approach to crowdsourcing corpus and annotate the text with correct dialect tags before any NLP step. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. We collected 200K tweets, 10K comments from newspaper, and 2M comments from Facebook with the total words equal to 13,876,504 words from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This annotation approach has so far achieved a 24K annotated documents, 587,549 tokens; 16,179 tagged as a dialect and 7,821 as MSA, with the total number of tokens equal to 487,549. This paper explores Twitter, Facebook, and Online newspaper as a source of Arabic dialect text, and describes the methods were used to extract tweets and comments then classify them into groups of dialects according to the geographic location of the sender and the country of the newspaper, and Facebook page. In addition to description of the annotation approach which we used to tag every tweet and comment.
@InProceedings{ALSHUTAYRI18.19, author = {Areej Alshutayri and Eric Atwell}, title = {Creating an Arabic Dialect Text Corpus by Exploring Twitter, Facebook, and Online Newspapers}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Hend Al-Khalifa and King Saud University and KSA
Walid Magdy and University of Edinburgh and UK
Kareem Darwish and Qatar Computing Research Institute and Qatar
Tamer Elsayed and Qatar University and Qatar}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-25-2}, language = {english} }