LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title A Web-based Text Corpora Development System
Authors Bohuş Dan (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, bd1206@cs.utt.ro)
Boldea Marian (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, boldea@cs.utt.ro)
Keywords Diacritic Characters Restoration, HTML-to-Text Conversion, Morpho-Syntactic Annotation, Part-of-Speech Tagging, Text Corpora
Session Session WP7 - Corpus Projects
Full Paper 105.ps, 105.pdf
Abstract One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available.