Title |
The Lácio-Web: Corpora and Tools to advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools |
Author(s) |
Sandra Aluisio (1), Gisele Montilha Pinheiro (1), Aline M. P. Manfrin (1), Leandro H. M. de Oliveira (1), Luiz C. Genoves Jr. (1), Stella E. O. Tagnin (2) (1) NILC/ICMC-USP: Núcleo Interinstitucional de Lingüística Computacional (NILC), ICMC-University of São Paulo, CP 668, 13560-970 São Carlos, SP, Brazil; (2) FFLCH-USP: FFLCH – DLM, University of São Paulo, Av. Prof. Luciano Gualberto, 403, 05508-900 - São Paulo – SP, Brazil |
Session |
P19-SW |
Abstract |
In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus. |
Keyword(s) |
Written corpora, Brazilian Portuguese, POS annotated corpus, interface issues, text typology, corpora associated tools |
Language(s) |
Portuguese |
Full Paper |