The scarcity of parallel resources between Tibetan and other languages lays a great difficulty for application of current researches in the fields of neural networks and deep learning. The construction of a large-scale parallel Tibetan corpus for Chinese, English and other languages also serves as a great importance for Tibetan NLP in general. More importantly, machine translation between Tibetan and other languages also poses many challenges compared to some current mature machine translation systems like English and other languages. The availability of large-scale multi-lingual parallel language resources is essential to enable minority language machine translation services to better serve the “Belt and Road”. In this work, through the research of the chapter-level, paragraph-level, sentence-level and word-level automatic acquisition techniques of Tibetan to other language texts, we proposed methods to acquire the knowledge needed for machine translation from the depth and breadth of knowledge mining. This first task in the work is to research on web-oriented automatic discriminant and extraction algorithms for acquiring the comparable corpus, at the same time, by maximizing local matching, to expand the size of the word alignment, phrase alignment library (block aligned library), in order to enrich the Tibetan related parallel language resources. The second is to study on individual paragraph representations based on the large-scale Chinese, Tibetan and English monolingual corpus. And by comparing the similarity of representations and optimizing the threshold to evaluate bilingual comparability both in horizontal and vertical directions. And third is to study the methods to improve the alignment of language resources using monolingual and tri-lingual word representations as well as the paragraph representations.
@InProceedings{JIACUO18.13, author = {Cizhen Jiacuo and Sangjie Duanzhu}, title = {A Study on Machine Translation-oriented Parallel Corpus Construction Techniques for Tibetan, Chinese and English }, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Erhong Yang and Le Sun}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-29-0}, language = {english} }