Summary of the paper

Title Annotation and Classification of Toxicity for Thai Twitter
Authors Sugan Sirihattasak, Mamoru Komachi and Hiroshi Ishikawa
Abstract In this study, we present toxicity annotation for a Thai Twitter Corpus as a preliminary exploration for toxicity analysis in the Thai language. We construct a Thai toxic word dictionary and select 3,300 tweets for annotation using the 44 keywords from our dictionary. We obtained 2,027 toxic tweets and 1,273 non-toxic tweets labeled by three annotators. The result of corpus analysis indicates that tweets that include toxic words are not always toxic. Further, it is more likely to that a tweet is toxic, if it contains toxic words indicating their original meaning. Moreover, disagreements in annotation are primarily due to sarcasm, unclear existing target, and word sense ambiguity. Finally, we conducted supervised classification using our corpus as a dataset and obtained an accuracy of 0.80, which is comparable with the inter-annotator agreement of this dataset. Our dataset is available on GitHub.
Full paper Annotation and Classification of Toxicity for Thai Twitter
Bibtex @InProceedings{SIRIHATTASAK18.1,
  author = {Sugan Sirihattasak ,Mamoru Komachi and Hiroshi Ishikawa},
  title = {Annotation and Classification of Toxicity for Thai Twitter},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Els Lefever and Bart Desmet and Guy De Pauw},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-27-6},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA