Summary of the paper

Title MirasText: An Automatically Generated Text Corpus for Persian
Authors Behnam Sabeti, Hossein Abedi Firouzjaee, Ali Janalizadeh Choobbasti, Seyed hani elamahdi Mortazavi Najafabadi and Amir Vaheb
Abstract Natural Language Processing is one of the most important fields of artificial intelligence. The rapid growth of digital content has made this field both practical and challenging at the same time. As opposed to less-resourced languages like Persian, there are several text corpora in dominant languages like English which can be used for NLP applications. \\In this paper, MirasText which is an automatically generated text corpus for Persian language is presented. In this study, over 250 Persian websites were crawled and several fields like content, description, keywords, title, etc have been extracted to generate MirasText. Topic modeling and language modeling are used to validate the generated corpus. MirasText has over 2.8 million documents and over 1.4 billion tokens, which to our knowledge is the largest Persian corpus currently available.
Topics Language Modelling, Corpus (Creation, Annotation, Etc.), Lr Infrastructures And Architectures
Full paper MirasText: An Automatically Generated Text Corpus for Persian
Bibtex @InProceedings{SABETI18.385,
  author = {Behnam Sabeti and Hossein Abedi Firouzjaee and Ali Janalizadeh Choobbasti and Seyed hani elamahdi Mortazavi Najafabadi and Amir Vaheb},
  title = "{MirasText: An Automatically Generated Text Corpus for Persian}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA