Title |
An API for Discourse-level Access to XML-encoded Corpora |
Authors |
Christoph Müller (European Media Laboratory GmbH Villa Bosch Schloß-Wolfsbrunnenweg 33 69118 Heidelberg, Germany) Michael Strube (European Media Laboratory GmbH Villa Bosch Schloß-Wolfsbrunnenweg 33 69118 Heidelberg, Germany) |
Session |
MMO1: Tools & Annotations |
Abstract |
We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats. |
Keywords |
Corpus exploitation, Standardization, Discourse processing, XML, Reusability |
Full Paper |