Summary of the paper

Title CTTC: A Collection of Tibetan Text Corpora
Authors Huidan Liu and Long Congjun
Abstract The Chinese Academy of Sciences launched the Multi-Layer MultiLingual Resource Database (MLLRD) project which aims to collect language resources for natural language processing tasks for low resource languages used in China, such as Mongolian, Tibetan, Uyghur and so on. Tibetan text corpus building is one of the sub projects, in which we have built a Collection of Tibetan Text Corpora(CTTC), including: (1) Tibetan web article corpus which has 440,900 documents. (2)Tibetan text classification corpus. (3) Chinese-Tibetan parallel text corpus which has 773,068 sentence pairs. (4) Part-Of-Speech tagged corpus which has 52,041 sentences. (5) Tibetan tree bank which has 6,040 trees. The paper reports the methods to build these corpora, the contents and scales of each corpus, and applications of them.
Topics Tree Bank, Machine Translation
Full paper CTTC: A Collection of Tibetan Text Corpora
Bibtex @InProceedings{LIU18.16,
  author = {Huidan Liu and Long Congjun},
  title = {CTTC: A Collection of Tibetan Text Corpora },
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Erhong Yang and Le Sun},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-29-0},
  language = {english}
  }
Powered by ELDA © 2018 ELDA/ELRA