Title |
A Web-based Text Corpora Development System |
Authors |
Bohuş Dan (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, bd1206@cs.utt.ro) Boldea Marian (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, boldea@cs.utt.ro) |
Keywords |
Diacritic Characters Restoration, HTML-to-Text Conversion, Morpho-Syntactic Annotation, Part-of-Speech Tagging, Text Corpora |
Session |
Session WP7 - Corpus Projects |
Full Paper |
105.ps, 105.pdf |
Abstract |
One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available. |