Title |
An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora |
Authors |
K Saravanan, Monojit Choudhury, Raghavendra Udupa and A Kumaran |
Abstract |
Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies. |
Topics |
Named Entity recognition, Corpus (creation, annotation, etc.), Lexicon, lexical database |
Full paper |
An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora |
Bibtex |
@InProceedings{SARAVANAN12.305,
author = {K Saravanan and Monojit Choudhury and Raghavendra Udupa and A Kumaran}, title = {An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |