Moder Standard Arabic (MSA) is the official language used in formal communications while Dialectal Arabic (DA) refers to the spoken languages in different Arab countries and regions, and they are widely used on social media for daily communications. There are differences between DA and MSA at almost all levels, and resources for DA are very limited compared to MSA. In this paper, we present the first and largest corpus of dialectal tweets with translations to MSA as provided by large number of native speakers through crowdsourcing. We describe how we collected the tweets, annotated them, and measured translation quality. This corpus supports research in understanding and quantifying differences between DA and MSA, dialect identification, converting DA to MSA (hence using MSA resources), and machine translation among other applications. Roughly, the corpus contains 5,500 and 5,000 tweets written in Egyptian and Maghrebi dialects with verified MSA translations (16,000 and 8,000 in order), and 6,000 tweets written in both Levantine and Gulf dialects with MSA translations before verification (18,000 for each). We make the corpus freely available for research purposes.
@InProceedings{MUBARAK18.13, author = {Hamdy Mubarak}, title = {Converting Dialectal Arabic to Modern Standard Arabic}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Hend Al-Khalifa and King Saud University and KSA
Walid Magdy and University of Edinburgh and UK
Kareem Darwish and Qatar Computing Research Institute and Qatar
Tamer Elsayed and Qatar University and Qatar}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-25-2}, language = {english} }