Title |
NLGbAse: A Free Linguistic Resource for Natural Language Processing Systems |
Authors |
Eric Charton and Juan-Manuel Torres-Moreno |
Abstract |
Availability of labeled language resources, such as annotated corpora and domain dependent labeled language resources is crucial for experiments in the field of Natural Language Processing. Most often, due to lack of resources, manual verification and annotation of electronic text material is a prerequisite for the development of NLP tools. In the context of under-resourced language, the lack of copora becomes a crucial problem because most of the research efforts are supported by organizations with limited funds. Using free, multilingual and highly structured corpora like Wikipedia to produce automatically labeled language resources can be an answer to those needs. This paper introduces NLGbAse, a multilingual linguistic resource built from the Wikipedia encyclopedic content. This system produces structured metadata which make possible the automatic annotation of corpora with syntactical and semantical labels. A metadata contains semantical and statistical informations related to an encyclopedic document. To validate our approach, we built and evaluated a Named Entity Recognition tool, trained with Wikipedia corpora annotated by our system. |
Topics |
Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval, Named Entity recognition |
Full paper |
NLGbAse: A Free Linguistic Resource for Natural Language Processing Systems |
Slides |
- |
Bibtex |
@InProceedings{CHARTON10.900,
author = {Eric Charton and Juan-Manuel Torres-Moreno}, title = {NLGbAse: A Free Linguistic Resource for Natural Language Processing Systems}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |