LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Building a Treebank for Italian: a Data-driven Annotation Schema
Authors Bosco Cristina (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), bosco@di.unito.it)
Lombardo Vincenzo (DISTA – Università del Piemonte Orientale “A. Avogadro”, c.so Borsalino 54, 15100 Alessandria, Italy, Centro di Scienza Cognitiva – Università di Torino, via Lagrange 3, 10123 Torino, Italy, vincenzo@di.unito.it)
Vassallo Daniela (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), vassallo@di.unito.it)
Lesmo Leonardo (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), lesmo@di.unito.it)
Keywords Annotation Schema, Corpus, Dependency Format, Italian, Treebank
Session Session WO2 - Treebanks
Full Paper 220.ps, 220.pdf
Abstract Many natural language researchers are currently turning their attention to treebank development and trying to achieve accuracy and corpus data coverage in their representation formats. This paper presents a data-driven annotation schema developed for an Italian treebank ensuring data coverage and consistency between annotation of linguistic phenomena. The schema is a dependency-based format centered upon the notion of predicate-argument structure augmented with traces to represent discontinuous constituents. The treebank development involves an annotation process performed by a human annotator helped by an interactive parsing tool that builds incrementally syntactic representation of the sentence. To increase the syntactic knowledge of this parser, a specific data-driven strategy has been applied. We describe the cyclical development of the annotation schema highlighting the richness and flexibility of the format, and we present some representational issues.