Title |
The CLaRK System: XML-based Corpora Development System for Rapid Prototyping |
Author(s) |
Kiril Simov, Alexander Simov, Hristo Ganev, Krasimira Ivanova, Ilko Grigorov BulTreeBank Project - http://www.BulTreeBank.org - Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Acad. G. Bonchev St. 25A, 1113 Sofia, Bulgaria, kivs@bultreebank.org, alex@bultreebank.org, ico@bultreebank.org, krassy_v@bultreebank.org, ilko@bultreebank.or |
Session |
P1-W |
Abstract |
The paper presents the CLaRK System as a tool for the creation of XML-based corpora and a platform for rapid prototyping. The system provides a set of basic tools for processing XML documents. These tools include: tokenizers, regular grammars, constraints; remove, insert, extract, sort, transformation operations. Additionally, the system is equipped with a macro language which allows the creation of tools sequences. The macro language includes a set of control operators for guiding the application of the tools in the macro. Usually, a tool or a macro works over a single document changing it or producing a new document. In some cases processing of more than one document is necessary --- in iterative statistics for treebank transformation, stand-off annotation, etc. For such processing the macro language allows a dynamic change of the processed documents. |
Keyword(s) |
XML corpora, corpora creation, prototyping |
Language(s) | Bulgarian |
Full Paper |