LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

151 Experiences of Language Engineering Algorithm Reuse Traditionally, the level of reusability of language processing resources within the research community has been very low. Most of the recycling of linguistic resources has been concerned with reuse of data, e.g., corpora, lexica, and grammars, while the algorithmic resources far too seldom have been shared between di?erent projects and institutions. As a consequence, researchers who are willing to reuse somebody else's processing components have been forced to invest major e?orts into issues of integration, inter-process communication, and interface design. In this paper, we discuss the experiences drawn from the svensk project regarding the issues on reusability of language engineering software as well as some of the challenges for the research community which are prompted by them. Their main characteristics can be laid out along three dimensions; technical/software challenges, linguistic challenges, and `political' challenges. In the end, the unavoidable conclusion is that it de?nitely is time to bring more aspects of engineering into the Computational Linguistic community!

153 Derivation in the Czech National Corpus The aim of this paper is to describe one the main means of Czech word formation - derivation. New Czech words are created by composition or by derivation (by using prefixes or suffixes). The suffixes which are added to the stem are used much more frequently than prefixes standing before the stem. The most frequent suffixes will be classified according to the paradigmatic and semantic properties and according to the changes they cause in the stem. The research is done on the Czech national corpus (CNC), the frequencies of the investigated suffixes illustrate their roductivity in present day Czech language. This research is of a particular value for a highly inflected language such as Czech. Possible applications of this system are various NLP systems, e.g. spelling checkers and machine translation systems. The results of this work serve for the computational processing of Czech word formation and in future for the creation of the Czech derivational dictionary.

155 Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.

156 The Context (not only) for Humans Our context considerations will be practically oriented; we will explore the specification of a context scope in the Czech morphological tagging. We mean by morphological tagging/annotation the automatic/manual disambiguation of the output of morphological analysis. The Prague Dependency Treebank (PDT) serves as a source of annotated data. The main aim is to concentrate on the evaluation of the influence of the chosen context on the tagging accuracy.

158 Something Borrowed, Something Blue: Rule-based Combination of POS Taggers Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation repre-sents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) taggers can be combined to better cope with text material of a type on which they were not trained, and for which there are no readily available training corpora. We indicate—using freely avail-able taggers for German (although the method we describe is not language-dependent)—how such taggers can be combined by using linguistically motivated rules so that the tagging accuracy of the combination exceeds that of the best of the individual taggers.

159 Screffva: A Lexicographer's Workbench This paper describes the implementation of Screffva, a computer system written in Prolog that employs a parallel corpus for the automatic generation of bilingual dictionary entries. Screffva provides a lemmatised interface between a parallel corpus and its bilingual dictionary. The system has been trialled with a parallel corpus of Cornish-English bitext. Screffva is able to retrieve any given segment of text, and uniquely identifies lexemes and the equivalences that exist between the lexical items in a bitext. Furthermore the system is able to cope with discontinuous multiword lexemes. The system is thus able to find glosses for individual lexical items or to produce longer lexical entries which include part-of-speech, glosses and example sentences from the corpus. The corpus is converted to a Prolog text database and lemmatised. Equivalents are then aligned. Finally Prolog predicates are defined for the retrieval of glosses, part-of-speech and example sentences to illustrate usage. Lexemes, including discontinuous multiword lexemes, are uniquely identified by the system and indexed to their respective segments of the corpus. Insofar as the system is able to identify specific translation equivalents in the bitext, the system provides a much more powerful research tool than existing concordancers such as ParaConc, WordSmith, XCorpus and Multiconcord. The system is able to automatically generate a bilingual dictionary which can be exported and used as the basis for a paper dictionary. Alternatively the system can be used directly as an electronic bilingual dictionary.

161 A Step toward Semantic Indexing of an Encyclopedic Corpus This paper investigates a method for extracting and acquiring knowledge from Linguistic resources. In particular, we propose an NLP based architecture for building a semantic network out of an XML on line encyclopedic corpus. The general application underlying this work is a question-answering system on proper nouns within an encyclopedia.

162 Issues in the Evaluation of Spoken Dialogue Systems - Experience from the ACCeSS Project We describe the framework and present detailed results of an evaluation of 1.500 dialogues recorded during a three-months field-trial of the ACCeSS Dialogue System. The system was routing incoming calls to agents of a call-center and handled about 100 calls per day.

163 Evaluating Summaries for Multiple Documents in an Interactive Environment While most people have a clear idea of what a single document summary should look like, this is not immediately obvious for a multi-document summary. There are many new questions to answer concerning the amount of documents to be summarized, the type of documents, the kind of summary that should be generated, the way the summary gets presented to the user, etc. The many approaches possible to multi-document summarization makes evaluation especially difficult. In this paper we will describe an approach to multi-document summarization and report work on an evaluation method for this particular system.

164 Grammarless Bracketing in an Aligned Bilingual Corpus We propose a simple grammarless procedure to extract phrasal examples from aligned parallel texts. Is is based on the difference of word sequence in two languages.

165 A Semi-automatic System for Conceptual Annotation, its Application to Resource Construction and Evaluation The CONCERTO project, primarily concerned with the annotation of texts for their conceptual content, combines automatic linguistic analysis with manual annotation to ensure the accuracy of fact extraction, and to encode content in a rich knowledge representation framework. The system provides annotation tools, automatic multi-level linguistic analysis modules, a partial parsing formalism with a more user friendly language than standard regular expression languages, XML-based document management, and a powerful knowledge representation and query facility. We describe the architecture and functionality of the system, and how it can be adapted for a range of resource construction tasks, and how the system can be configured to compute statistics on the accuracy of its automatic analysis components.

166 The MATE Workbench Annotation Tool, a Technical Description The MATE workbench is a tool which aims to simplify the tasks of annotating, displaying and querying speech or text corpora. It is designed to help humans create language resources, and to make it easier for different groups to use one another’s data, by providing one tool which can be used with many different annotation schemes. Any annotation scheme which can be converted to XML can be used with the workbench, and display formats optimised for particular annotation tasks are created using a transformation language similar to XSLT. The workbench is written entirely in Java, which means that it is platform-independent.

167 Recruitment Techniques for Minority Language Speech Databases: Some Observations This paper describes the collection efforts for SpeechDat Cymru, a 2000-speaker database for Welsh, a minority language spoken by about 500,000 of the Welsh population. The database is part of the SpeechDat(II) project. General database details are discussed insofar as they affect recruitment strategies, and likely differences between minority language spoken language resource (SLR) and general SLR collection are noted. Individual recruitment techniques are then detailed, with an indication of their relative successes and relevance to minority language SLR collection generally. It is observed that no one technique was sufficient to collect the entire database, and that those techniques involving face-to-face recruitment by an individual closely involved with the database collection produced the best yields for effort expended. More traditional postal recruitment techniques were less successful. The experiences during collection underlined the importance of utilising enthusiastic recruiters, and taking advantage of the speaker networks present in the community.

168 Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success.

169 PoS Disambiguation and Partial Parsing Bidirectional Interaction This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head information as a word analysis. This interaction of PoS disambiguation and partial parsing reduces the effort needed for writing rules considerably. Furthermore, the methodology we propose improves both efficiency and results.

170 Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis This paper presents a taxonomy of previous work on infrastructures, architectures and development environments for representing and processing Language Resources (LRs), corpora, and annotations. This classification is then used to derive a set of requirements for a Software Architecture for Language Engineering (SALE). The analysis shows that a SALE should address common problems and support typical activities in the development, deployment, and maintenance of LE software. The results will be used in the next phase of construction of an infrastructure for LR production, distribution, and access.

172 XCES: An XML-based Encoding Standard for Linguistic Corpora The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications. We have instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and ''standoff'' annotation in separate documents. Conversion to XML enables use of some of the more powerful mechanisms provided in the XML framework, including the XSLT Transformation Language, XML Schemas, and support for inter-rescue reference together with an extensive path syntax for pointers. In this paper, we describe the differences between the CES and XCES DTDs and demonstrate how XML mechanisms can be used to select from and manipulate annotated corpora encoded according to XCES specifications. We also provide a general overview of XML and the XML mechanisms that are most relevant to language engineering research and applications.

173 Named Entity Recognition in Greek Texts In this paper, we describe work in progress for the development of a named entity recognizer for Greek. The system aims at information extraction applications where large scale text processing is needed. Speed of analysis, system robustness, and results accuracy have been the basic guidelines for the system’s design. Our system is an automated pipeline of linguistic components for Greek text processing based on pattern matching techniques. Non-recursive regular expressions have been implemented on top of it in order to capture different types of named entities. For development and testing purposes, we collected a corpus of financial texts from several web sources and manually annotated part of it. Overall precision and recall are 86% and 81% respectively.

174 A Robust Parser for Unrestricted Greek Text In this paper we describe a method for the efficient parsing of real-life Greek texts at the surface syntactic level. A grammar consisting of non-recursive regular expressions describing Greek phrase structure has been compiled into a cascade of finite state transducers used to recognize syntactic constituents. The implemented parser lends itself to applications where large scale text processing is involved, and fast, robust, and relatively accurate syntactic analysis is necessary. The parser has been evaluated against a ca 34000 word corpus of financial and news texts and achieved promising precision and recall scores.

175 A Computational Platform for Development of Morphologic and Phonetic Lexica Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language. The integrated graphic user interface focuses in morphologic and phonetic aspects of the Slovenian language and allows the experts good performance in analysis time.

176 An Open Architecture for the Construction and Administration of Corpora The use of language corpora for a variety of purposes has increased significantly in recent years. General corpora are now available for many languages, but research often requires more specialized corpora. The rapid development of the World Wide Web has greatly improved access to data in electronic form, but research has tended to focus on corpus annotation, rather than on corpus building tools. Therefore many researchers are building their own corpora, solving problems independently, and producing project-specific systems which cannot easily be re-used. This paper proposes an open client-server architecture which can service the basic operations needed in the construction and administration of corpora, but allows customisation by users in order to carry out project-specific tasks. The paper is based partly on recent practical experience of building a corpus of 10 million words of Written Business English from webpages, in a project which was co-funded by ELRA and the University of Wolverhampton.

177 Design of Optimal Slovenian Speech Corpus for Use in the Concatenative Speech Synthesis System In the paper the development of Slovenian speech corpus for use in concatenative speech synthesis system being developed at University of Maribor, Slovenia, will be presented. The emphasis in the paper is the issue of maximising the usefulness of the defined speech corpus for concatenation purposes. Usefulness of the speech corpus very much depends on the corresponding text and can be increased if the appropriate text is chosen. In the approach we used, detailed statistics of the text corpora has been done, to be able to define the sentences, rich with non-uniform units like monophones, diphones and triphones.

179 CLinkA A Coreferential Links Annotator The annotation of coreferential chains in a text is a difficult task, which requires a lot of concentration. Given its complexity, without an appropriate tool it is very difficult to produce high quality coreferentially annotated corpora. In this paper we discus the requirements for developing a tool for helping the human annotator in this task. The annotation scheme used by our program is derived from the one proposed by MUC-7 Coreference Task Annotation, but is not restricted only to that one. Using a very simple language the user is able to define his/her own annotation scheme. The tool has a user-friendly interface and is language and platform independent.

180 What's in a Thesaurus? We first describe four varieties of thesaurus: (1) Roget-style, produced to help people find synonyms when they are writing; (2) WordNet and EuroWordNet; (3) thesauruses produced (manually) to support information retrieval systems; and (4) thesauruses produced auto-matically from corpora. We then contrast thesauruses and dictionaries, and present a small experiment in which we look at polysemy in relation to thesaurus structure. It has sometimes been assumed that different dictionary senses for a word that are close in meaning will be near neighbours in the thesaurus. This hypothesis is explored, using as inputs the hierarchical structure of WordNet 1.5 and a mapping between WordNet senses and the senses of another dictionary. The experiment shows that pairs of ‘lexicographically close’ meanings are frequently found in different parts of the hierarchy.

181 A Unified POS Tagging Architecture and its Application to Greek This paper proposes a flexible and unified tagging architecture that could be incorporated into a number of applications like information extraction, cross-language information retrieval, term extraction, or summarization, while providing an essential component for subsequent syntactic processing or lexicographical work. A feature-based multi-tiered approach (FBT tagger) is introduced to part-of-speech tagging. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek. Additionally, a large experiment concerning the Greek language is conducted and results are presented for a variety of text genres, including financial reports, newswires, press releases and technical manuals. Finally, the adopted evaluation methodology is discussed.

182 Resources for Lexicalized Tree Adjoining Grammars and XML Encoding: TagML This work addresses both practical and theorical purposes for the encoding and the exploitation of linguistic resources for feature based Lexicalized Tree Adjoining grammars (LTAG). The main goals of these specifications are the following ones: 1. Define a recommendation by the way of an XML (Bray et al., 1998) DTD or schema (Fallside, 2000) for encoding LTAG resources in order to exchange grammars, share tools and compare parsers. 2. Exploit XML, its features and the related recommendations for the representation of complex and redundant linguistic structures based on a general methodology. 3. Study the resource organisation and the level of generalisation which are relevant for a lexicalized tree grammar.

183 Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types).

184 ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on “Annotation Graphs,” a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic “signals,” including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.

185 Models of Russian Text/Speech Interactive Databases for Supporting of Scientific, Practical and Cultural Researches The paper briefly describes the following databases: ”Online Sound Archives from St. Petersburg Collections”, ”Regional Variants of the Russian Speech”, and ”Multimedia Dictionaries of the minor Languages of Russia”, the principle feature of which is the built-in support for scientific, practical and cultural researches. Though these databases are addressed to researchers engaged mainly in Spoken Language Processing and because of that their main object is Sound, proposed database ideology and general approach to text/speech data representation and access may be further used for elaboration of various language resources containing text, audio and video data. Such approach requests for special representation of the database material. Thus, all text and sound files should be accompanied by information on their multi-level segmentation, which should allow the user to extract and analyze any segment of text or speech. Each significant segment of the database should be perceived as a potential object of investigation and should be supplied by tables of descriptive parameters, mirroring its various characteristics. The list of these parameters for all potential objects is open for further possible extension.

186 Some Technical Aspects about Aligning Near Languages IULA at UPF has developed an aligner that benefits from corpus processing results to produce an accurate and robust alignment, even with noisy parallel corpora. It compares lemmata and part-of-speech tags of analysed texts but it has two main characteristics. First, apparently it only works for near languages and second it requires morphological taggers for the compared languages. These two characteristics prevent this technique from being used for any pair of languages. Whevener it its applicable, a high quality of results is achieved.

187 Corpus Resources and Minority Language Engineering Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g. Charoenporn, 1997, Hall and Hudson, 1997, Somers, 1997, Nirenberg and Raskin, 1998, Somers, 1998). However, much work related to low density languages is still in its infancy, or worse, work is blocked because the resources needed by language engineers are not available. In response to this situation, the MILLE (Minority Language Engineering) project was established by the Engineering and Physical Sciences Research Council (EPSRC) in the UK to discover what language corpora should be built to enable language engineering work on non-indigenous minority languages in the UK, most of which are typically low- density languages. This paper summarises some of the major findings of the MILLE project.

189 CDB - A Database of Lexical Collocations CDB is a relational database designed for the particular needs of representing lexical collocations. The relational model is defined such that competence-based descriptions of collocations (the competence base) and actually occurring collocation examples extracted from text corpora (the example base) complete each other. In the paper, the relational model is described and examples for the representation of German PP-verb collocations are given. A number of example queries are presented, and additional facilities which are built on top of the database are discussed.

191 Evaluation for Darpa Communicator Spoken Dialogue Systems The overall objective of the DARPA COMMUNICATOR project is to support rapid, cost-effective development of multi-modal speech-enabled dialogue systems with advanced conversational capabilities, such as plan optimization, explanation and negotiation. In order to make this a reality, we need to find methods for evaluating the contribution of various techniques to the users’ willingness and ability to use the system. This paper reports on the approach to spoken dialogue system evaluation that we are applying in the COMMUNICATOR program. We describe our overall approach, the experimental design, the logfile standard, and the metrics applied in the experimental evaluation planned for June of 2000.

192 Transcribing with Annotation Graphs Transcriber is a tool for manual annotation of large speech files. It was originally designed for the broadcast news transcription task. The annotation file format was derived from previous formats used for this task, and many related features were hard-coded. In this paper we present a generalization of the tool based on the annotation graph formalism, and on a more modular design. This will allow us to address new tasks, while retaining Transcriber’s simple, crisp user-interface which is critical for user acceptance.

193 Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results We are annotating a corpus with information relevant to discourse entity realization, and especially the information needed to decide which type of NP to use. The corpus is being used to study correlations between NP type and certain semantic or discourse features, to evaluate hand-coded algorithms, and to train statistical models. We report on the development of our annotation scheme, the problems we have encountered, and the results obtained so far.

194 Towards a Query Language for Annotation Graphs The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation.

196 The American National Corpus: A Standardized Resource for American English At the first conference on Language Resources and Evaluation, Granada 1998, Charles Fillmore, Nancy Ide, Daniel Jurafsky, and Catherine Macleod proposed creating an American National Corpus (ANC) that would compare with the British National Corpus (BNC) both in balance and in size (one hundred million words). This paper reports on the progress made over the past two years in launching the project. At present, the ANC project is well underway, with commitments for support and contribution of texts from a number of publishers world-wide.

197 Semantic Tagging for the Penn Treebank This paper describes the methodology that is being used to augment the Penn Treebank annotation with sense tags and other types of semantic information. Inspired by the results of SENSEVAL, and the high inter-annotator agreement that was achieved there, similar methods were used for a pilot study of 5000 words of running text from the Penn Treebank. Using the same techniques of allowing the annotators to discuss difficult tagging cases and to revise WordNet entries if necessary, comparable inter-annotator rates have been achieved. The criteria for determining appropriate revisions and ensuring clear sense distinctions are described. We are also using hand correction of automatic predicate argument structure information to provide additional thematic role labeling.

199 Rule-based Tagging: Morphological Tagset versus Tagset of Analytical Functions This work presents a part of a more global study on the problem of parsing of Czech and on the knowledge extraction capabilities of the Rule-based method. It is shown that the successfulness of the Rule-based method for English and its unsuccessfulness for Czech, is not only due to the small cardinality of the English tagset (as it is usually claimed) but mainly depends on its structure (”regularity” of the language information).

200 The (Un)Deterministic Nature of Morphological Context The aim of this paper is to contribute to the study of the context within natural language processing and to bring in aspects which, I believe, have a direct influence on the interpretation of the success rates and on a more successful design of language models. This work tries to formalize the (ir)regularities, dynamic characteristics, of context using techniques from the field of chaotic and non-linear systems. The observations are done on the problem of POS tagging.

from the field of chaotic and non-linear systems. The observations are done on the problem of POS tagging.