Information
Workshop description
Program
ISLE
Workshop Page
Contact
Related Links
American National
Corpus
ANLP-NAACL
Workshop
on
Annotation and Software Standards
|
LREC WORKSHOP
Data Architectures and Software Support for Large
Corpora
Towards an American National
Corpus
Athens, Greece
30 May 2000
This workshop has been merged with the
EAGLES/ISLE
Workshop on Meta-Descriptions and Annotation Schemas for Multimodal/Multimedia
Language Resources.
A full program and description of the workshops and
information for authors can be found HERE.
Description
Several software systems for linguistic annotation, search, and retrieval
of large corpora have been developed within the natural language processing
community over the past several years, including LT-XML (Edinburgh), GATE
(Sheffield), IMS Corpus Workbench (Stuttgart), Alembic Workbench (Mitre),
MATE (Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA (BNC), and
several others. Related to and in support of this development, there have
also been efforts to develop standards for encoding and various kinds of
linguistic annotation, as well as data architectures (e.g., TIPSTER, TalkBank)
etc. Still other developments, such as the introduction of XML and the
powerful XSL transformation language and work on semi-structured data (e.g.,
the work of the Lore group at Stanford), have also impacted the ways in
which corpora and other linguistic resources can be represented, stored,
and accessed.
Approaches to the fundamental design of the formats, data, and tools
are varied among current systems for the annotation and exploitation of
linguistic corpora. A primary reason for this diversity is that most developers
are concerned with only one aspect of the creation/annotation/exploitation
process. However, in order to work effectively toward commonality, the
phases of the process must be considered as a whole. This demands bringing
together researchers and developers from a variety of domains in text,
speech, video, etc., many of whom have previously had little or no contact.
This workshop is intended to bring these groups together to look broadly
at the technical issues that bear on the development of software systems
for the annotation and exploitation of linguistic resources. The goal is
to lay the groundwork for the definition of a data and system architecture
to support corpus annotation and exploitation that can be widely adopted
within the community. Among the issues to be addressed are:
-
layered data architectures
-
system architectures for distributed databases
-
support for plurality of annotation schemes
-
impact and use of XML/XSL
-
support for multimedia, including speech and video
-
tools for creation, annotation, query and access of corpora
-
mechanisms for linkage of annotation and primary data
-
applicability of semi-structured data models, search and query systems,
etc.
-
evaluation/validation of systems and annotations
The motivation for this workshop is the American National Corpus (ANC)
effort, which should begin corpus creation within the year. We anticipate
that the ANC will provide a significant resource for natural language processing,
and we therefore seek to identify state-of-the-art methods for its creation,
annotation, and exploitation. Also, as a national and freely available
resource, the data and system architecture of the ANC is likely to become
a de facto standard. We therefore hope to draw together leading researchers
and developers to establish a basis for the design of a system to support
the creation and use of the ANC.
A "Birds of a Feather" session for those interested in the ANC project
will be held immediately following the workshop.
Contact
Nancy Ide
Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
Tel : +1 914 437 5988
Fax : +1 914 437 7498
ide@vassar.edu
|