With the development of the Internet, information extraction has become an important direction in the field of natural language processing. Among them, named entity identification is a hot spot of information extraction, which is an important part of the practical process of natural language processing. There are a large number of entities in the text, such as person names, location names and organization names, which in some sense are of the times. With the development of society and information technology, there are constantly new named entities appearing, and some of the named entities may be eliminated. It is almost impractical to construct a dictionary containing all named entities. Therefore, automatic identification of named entities is an important task, otherwise entities that are not recognized when natural language processing of texts is became unknown words. Thus, affecting the performance of machine translation, knowledge map construction, question answering system, syntax analysis and other application areas. There are a large number of named entity annotation corpora in languages such as English and Chinese at present. The named entity recognition technologies in these languages are relatively mature. But low-resource languages, like Uyghur, so far, no publicly available named entity corpus has yet to appear. On the one hand, it limits the research of Uyghur named entity recognition, on the other hand, it has some influence on the development of Uyghur information extraction technology. Therefore, this article first collected a large number of bilingual corpora in the field of news. It explores how to quickly establish a named entity corpus, for a resource-deficient language, by using cross-language named entity recognition technology. Specifically, firstly, named entities are automatically labeled for resource-rich languages; secondly, sentence pairs with named entities are selected and pre-labeled for resource-scarce language sentences using bilingual named entity dictionaries. Finally, corrections and supplements were made manually, and annotation memory technology was used to further improve the efficiency and quality of annotation. The above corpus construction process accelerates the annotation speed on the one hand, ensures consistency of annotation, and on the other hand, reduces the costs of manual annotation. Thus, Uyghur language location name annotation corpus, organization name annotation corpus, person name annotation corpus and Uyghur named entity annotation corpus are respectively constructed.
@InProceedings{MAIMAITI18.14, author = {Maihemuti Maimaiti and Aishan Wumaier}, title = {Construction of Uyghur named entity corpus }, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Erhong Yang and Le Sun}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-29-0}, language = {english} }