LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Building a Treebank for Italian: a Data-driven Annotation Schema |
Authors | Bosco Cristina (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), bosco@di.unito.it) Lombardo Vincenzo (DISTA – Università del Piemonte Orientale “A. Avogadro”, c.so Borsalino 54, 15100 Alessandria, Italy, Centro di Scienza Cognitiva – Università di Torino, via Lagrange 3, 10123 Torino, Italy, vincenzo@di.unito.it) Vassallo Daniela (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), vassallo@di.unito.it) Lesmo Leonardo (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), lesmo@di.unito.it) |
Keywords | Annotation Schema, Corpus, Dependency Format, Italian, Treebank |
Session | Session WO2 - Treebanks |
Full Paper | 220.ps, 220.pdf |
Abstract | Many natural language researchers are currently turning their attention to treebank development and trying to achieve accuracy and corpus data coverage in their representation formats. This paper presents a data-driven annotation schema developed for an Italian treebank ensuring data coverage and consistency between annotation of linguistic phenomena. The schema is a dependency-based format centered upon the notion of predicate-argument structure augmented with traces to represent discontinuous constituents. The treebank development involves an annotation process performed by a human annotator helped by an interactive parsing tool that builds incrementally syntactic representation of the sentence. To increase the syntactic knowledge of this parser, a specific data-driven strategy has been applied. We describe the cyclical development of the annotation schema highlighting the richness and flexibility of the format, and we present some representational issues. |