LREC 2008 Proceedings

Summary of the paper

Title	A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Authors	Keiji Shinzato, Daisuke Kawahara, Chikara Hashimoto and Sadao Kurohashi
Abstract	In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.
Language	Language-independent
Topics	LR Infrastructures and Architectures, LR web services, Standards for LRs
Full paper	A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Slides	-
Bibtex	@InProceedings{SHINZATO08.564, author = {Keiji Shinzato, Daisuke Kawahara, Chikara Hashimoto and Sadao Kurohashi}, title = {A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure}, booktitle = {Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)}, year = {2008}, month = {may}, date = {28-30}, address = {Marrakech, Morocco}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-4-0}, note = {http://www.lrec-conf.org/proceedings/lrec2008/}, language = {english} }