LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

58 Cairo: An Alignment Visualization Tool While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system. We developed Cairo to fill this need. Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters. This program can be easily adapted for visualization of correspondences in bi-texts based on probability distributions.

367 Cardinal, Nominal or Ordinal Similarity Measures in Comparative Evaluation of Information Retrieval Process Similarity measures are used to quantify the resemblance of two sets. Simplest ones are calculated by ratios of the document's number of the compared sets. These measures are simple and usually employed in first steps of evaluation studies, they are called cardinal measures. Others measures compare sets upon the number of common documents they have. They are usually employed in quantitative information retrieval evaluations, some examples are Jaccard, Cosine, Recall or Precision. These measures are called nominal ones. There are more or less adapted in function of the richness of the information system's answer. Indeed, in the past, they were sufficient because answers given by systems were only composed by an unordered set of documents. But usual systems improve the quality or the visibility of there answers by using a relevant ranking or a clustering presentation of documents. In this case, similarity measures aren't adapted. In this paper we present some solutions in the case of totally ordered and partially ordered answer.

189 CDB - A Database of Lexical Collocations CDB is a relational database designed for the particular needs of representing lexical collocations. The relational model is defined such that competence-based descriptions of collocations (the competence base) and actually occurring collocation examples extracted from text corpora (the example base) complete each other. In the paper, the relational model is described and examples for the representation of German PP-verb collocations are given. A number of example queries are presented, and additional facilities which are built on top of the database are discussed.

327 Chinese-English Semantic Resource Construction We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www .how-net.com), and a large machine-readable dic-tionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval.

179 CLinkA A Coreferential Links Annotator The annotation of coreferential chains in a text is a difficult task, which requires a lot of concentration. Given its complexity, without an appropriate tool it is very difficult to produce high quality coreferentially annotated corpora. In this paper we discus the requirements for developing a tool for helping the human annotator in this task. The annotation scheme used by our program is derived from the one proposed by MUC-7 Coreference Task Annotation, but is not restricted only to that one. Using a very simple language the user is able to define his/her own annotation scheme. The tool has a user-friendly interface and is language and platform independent.

364 COCOSDA - a Progress Report This paper presents a review of the activities of COCOSDA, the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output. COCOSDA has a history of innovative actions which spawn national and regional consortia for the co-operative development of speech corpora and for the promotion of research in related topics. COCOSDA has recently undergone a change of organisation in order to meet the developing needs of the speech- and language-processing technologies and this paper summarises those changes.

2 Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor - Collocations, the com bination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exem plifying an application to K anji character decision processes for Japanese w ord processors. U nlike recent trials of autom atic extraction, our collocations were collected m anually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper com bination of words in a corpus and (2) recollecting similar com binations of words, incited by it. This procedure, which depends on hum an judgm ent and the enrichm ent of data by association, is effective for rem edying the sparseness of data problem , although the arbitrariness of hum an judgm ent is inevitable. A pproximately seventy two thousand and four hundred collocations w ere used as w ord co-occurrence restriction data for deciding K anji characters in the processing of Japanese w ord processores. Experiments have show n that the collocation data yield 8.9% higher fraction of Kana-to-Kanji character conversion accuracy than the system w hich uses no collocation data and 7.0% higher, than a com m ercial word processor software of average perform ance.

78 Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts This paper presents how we constructed a tagged E-J parallel corpus of sample abstracts, which is the core language resource for our English abstract writing tool, the “Abstract Helper.” This writing tool is aimed at helping Japanese software engineers be more productive in writing by providing them with good models of English abstracts. We collected 539 English abstracts from technical journals/proceedings and prepared their Japanese translations. After analyzing the rhetorical structure of these sample abstracts, we tagged each sample abstract with both an abstract type and an organizational-scheme type. We also tagged each sample sentence with a sentence role and one or more verb complementation patterns. We also show that our tagged E-J parallel corpus of sample abstracts can be effectively used for providing users with both discourse-level guidance and sentence-level assistance. Finally, we discuss the outlook for further development of the “Abstract Helper.”

99 Controlled Bootstrapping of Lexico-semantic Classes as a Bridge between Paradigmatic and Syntagmatic Knowledge: Methodology and Evaluation Semantic classification of words is a highly context sensitive and somewhat moving target, hard to deal with and even harder to evaluate on an objective basis. In this paper we suggest a step–wise methodology for automatic acquisition of lexico–semantic classes and delve into the non trivial issue of how results should be evaluated against a top–down reference standard.

236 Coping with Lexical Gaps when Building Aligned Multilingual Wordnets In this paper we present a methodology for automatically classifying the translation equivalents of a machine readable bilingual dictionary in three main groups: lexical units, lexical gaps (that is cases when a lexical concept of a language does not have a correspondent in the other language) and translation equivalents that need to be manually classified as lexical units or lexical gaps. This preventive classification reduces the manual work necessary to cope with lexical gaps in the construction of aligned multilingual wordnets.

100 Coreference Annotation: Whither? The terms coreference and anaphora tend to be used inconsistently and interchangeably in much empirically-oriented work in NLP, and this threatens to lead to incoherent analyses of texts and arbitrary loss of information. This paper discusses the role of coreference annotation in Information Extraction, focussing on the coreference scheme defined for the MUC-7 evaluation exercise. We point out deficiencies in that scheme and make some suggestions towards a new annotation philosophy.

19 Coreference in Annotating a Large Corpus The Prague Dependency Treebank (PDT) is a part of the Czech National Corpus, annotated with disambiguated structural descriptions representing the meaning of every sentence in its environment. To achieve that aim, it is necessary i.a. to make explicit (at least some basic) coreferential relations within the sentence boundaries and also beyond them. The PDT scenario includes both automatic and 'manual' procedures; among the former type, there is one that concerns coreference, indicating the lemma of the subject in a specific attribute of the label belonging to a node for a reflexive pronoun, and assigning the deleted nodes in coordinated constructions the lemmas of their counterparts in the given construction. 'Manual' operations restore nodes for the deleted items mostly as pronouns. The distinction between grammatical and textual coreference is reflected. In order to get a possibility of handling textual coreference, specific attributes reflect the linking of sentences to each other and to the context of situation, and the development of the degrees of activation of the 'stock of shared knowledge' will be registered in so far as they are derivable from the use of nouns in subsequent utterances in a discourse.

131 Coreference Resolution Evaluation Based on Descriptive Specificity This paper introduces a new evaluation method for the coreference resolution task. Considering that coreference resolution is a matter of linking expressions to discourse referents, we set our evaluation criteron in terms of an evaluation of the denotations assigned to the expressions. This criterion requires that the coreference chains identified in one annotation stand in a one-to-one correspondence with the coreference chains in the other. To determine this correspondence and with a view to keep closer to what human interpretation of the coreference chains would be, we take into account the fact that, in a coreference chain, some expressions are more specific to their referent than others. With this observation in mind, we measure the similarity between the chains in one annotation and the chains in the other, and then compute the optimal similarity between the two annotations. Evaluation then consists in checking whether the denotations assigned to the expressions are correct or not. New measures to analyse errors are also introduced. A comparison with other methods is given at the end of the paper.

288 Corpora of Slovene Spoken Language for Multi-lingual Applications The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools (EAGLES Handbook 1997). This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics.

187 Corpus Resources and Minority Language Engineering Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g. Charoenporn, 1997, Hall and Hudson, 1997, Somers, 1997, Nirenberg and Raskin, 1998, Somers, 1998). However, much work related to low density languages is still in its infancy, or worse, work is blocked because the resources needed by language engineers are not available. In response to this situation, the MILLE (Minority Language Engineering) project was established by the Engineering and Physical Sciences Research Council (EPSRC) in the UK to discover what language corpora should be built to enable language engineering work on non-indigenous minority languages in the UK, most of which are typically low- density languages. This paper summarises some of the major findings of the MILLE project.

22 Creating and Using Domain-specific Ontologies for Terminological Applications Huge volumes of scientific databases and text collections are constantly becoming available, but their usefulness is at present hampered by their lack of uniformity and structure. There is therefore an overwhelming need for tools to facilitate the processing and discovery of technical terminology, in order to make processing of these resources more efficient. Both NLP and statistical techniques can provide such tools, but they would benefit greatly from the availability of suitable lexical resources. While information resources do exist in some areas of terminology, these are not designed for linguistic use. In this paper, we investigate how one such resource, the UMLS, is used for terminological acquisition in the TRUCKS system, and how other domain-specific resources might be adapted or created for terminological applications.

52 Creation of Spoken Hebrew Databases Two Spoken Hebrew databases were collected over fixed telephone lines at NSC - Natural Speech Communication. Their creation was based on the SpeechDat model, and represents the first comprehensive spoken database in Modern Hebrew that can be successfully applied to the teleservices industry. The speakers are a representative sample of Israelis, based on sociolinguistic factors such as age, gender, years of education and country of origin. The database includes, digit sequences, natural numbers, money amounts, time expressions, dates, spelled words, application words and phrases for teleservices (e.g., call, save, play), phonetically rich words, phonetically rich sentences, and names. Both read speech and spontaneous speech were elicited.

147 Cross-lingual Interpolation of Speech Recognition Models A method is proposed for implementing the cross-lingual porting of recognition models for rapid prototyping of speech recognisers in new target languages, specifically when the collection of large speech corpora for training would be economically questionable. The paper describes a way to build up a multilingual model which includes the phonetic structure of all the constituent languages, and which can be exploited to interpolate the recognition units of a different language. The CTSU (Classes of Transitory-Stationary Units) approach is exploited to derive a well balanced set of recognition models, as a reasonable trade-off between precision and trainability. The phonemes of the untrained language are then mapped onto the multilingual inventory of recognition units, and the corresponding CTSUs are then obtained. The procedure was tested with a preliminary set of 10 Rumanian speakers starting from an Italian-English-Spanish CTSU model. The optimal mapping of the vowel phone set of this language onto the multilingual phone set was obtained by inspecting the F1 and F2 formants of the vowel sounds from two male and female Rumanian speakers, and by comparing them with the values of F1 and F2 of the other three languages. Results in terms of recognition word accuracy measured on a preliminary test set of 10 speakers are reported.

nt>

Paper	Paper Title	Abstract
58	Cairo: An Alignment Visualization Tool	While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system. We developed Cairo to fill this need. Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters. This program can be easily adapted for visualization of correspondences in bi-texts based on probability distributions.
367	Cardinal, Nominal or Ordinal Similarity Measures in Comparative Evaluation of Information Retrieval Process	Similarity measures are used to quantify the resemblance of two sets. Simplest ones are calculated by ratios of the document's number of the compared sets. These measures are simple and usually employed in first steps of evaluation studies, they are called cardinal measures. Others measures compare sets upon the number of common documents they have. They are usually employed in quantitative information retrieval evaluations, some examples are Jaccard, Cosine, Recall or Precision. These measures are called nominal ones. There are more or less adapted in function of the richness of the information system's answer. Indeed, in the past, they were sufficient because answers given by systems were only composed by an unordered set of documents. But usual systems improve the quality or the visibility of there answers by using a relevant ranking or a clustering presentation of documents. In this case, similarity measures aren't adapted. In this paper we present some solutions in the case of totally ordered and partially ordered answer.
189	CDB - A Database of Lexical Collocations	CDB is a relational database designed for the particular needs of representing lexical collocations. The relational model is defined such that competence-based descriptions of collocations (the competence base) and actually occurring collocation examples extracted from text corpora (the example base) complete each other. In the paper, the relational model is described and examples for the representation of German PP-verb collocations are given. A number of example queries are presented, and additional facilities which are built on top of the database are discussed.
327	Chinese-English Semantic Resource Construction	We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www .how-net.com), and a large machine-readable dic-tionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval.
179	CLinkA A Coreferential Links Annotator	The annotation of coreferential chains in a text is a difficult task, which requires a lot of concentration. Given its complexity, without an appropriate tool it is very difficult to produce high quality coreferentially annotated corpora. In this paper we discus the requirements for developing a tool for helping the human annotator in this task. The annotation scheme used by our program is derived from the one proposed by MUC-7 Coreference Task Annotation, but is not restricted only to that one. Using a very simple language the user is able to define his/her own annotation scheme. The tool has a user-friendly interface and is language and platform independent.
364	COCOSDA - a Progress Report	This paper presents a review of the activities of COCOSDA, the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output. COCOSDA has a history of innovative actions which spawn national and regional consortia for the co-operative development of speech corpora and for the promotion of research in related topics. COCOSDA has recently undergone a change of organisation in order to meet the developing needs of the speech- and language-processing technologies and this paper summarises those changes.
2	Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor -	Collocations, the com bination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exem plifying an application to K anji character decision processes for Japanese w ord processors. U nlike recent trials of autom atic extraction, our collocations were collected m anually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper com bination of words in a corpus and (2) recollecting similar com binations of words, incited by it. This procedure, which depends on hum an judgm ent and the enrichm ent of data by association, is effective for rem edying the sparseness of data problem , although the arbitrariness of hum an judgm ent is inevitable. A pproximately seventy two thousand and four hundred collocations w ere used as w ord co-occurrence restriction data for deciding K anji characters in the processing of Japanese w ord processores. Experiments have show n that the collocation data yield 8.9% higher fraction of Kana-to-Kanji character conversion accuracy than the system w hich uses no collocation data and 7.0% higher, than a com m ercial word processor software of average perform ance.
78	Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts	This paper presents how we constructed a tagged E-J parallel corpus of sample abstracts, which is the core language resource for our English abstract writing tool, the “Abstract Helper.” This writing tool is aimed at helping Japanese software engineers be more productive in writing by providing them with good models of English abstracts. We collected 539 English abstracts from technical journals/proceedings and prepared their Japanese translations. After analyzing the rhetorical structure of these sample abstracts, we tagged each sample abstract with both an abstract type and an organizational-scheme type. We also tagged each sample sentence with a sentence role and one or more verb complementation patterns. We also show that our tagged E-J parallel corpus of sample abstracts can be effectively used for providing users with both discourse-level guidance and sentence-level assistance. Finally, we discuss the outlook for further development of the “Abstract Helper.”
99	Controlled Bootstrapping of Lexico-semantic Classes as a Bridge between Paradigmatic and Syntagmatic Knowledge: Methodology and Evaluation	Semantic classification of words is a highly context sensitive and somewhat moving target, hard to deal with and even harder to evaluate on an objective basis. In this paper we suggest a step–wise methodology for automatic acquisition of lexico–semantic classes and delve into the non trivial issue of how results should be evaluated against a top–down reference standard.
236	Coping with Lexical Gaps when Building Aligned Multilingual Wordnets	In this paper we present a methodology for automatically classifying the translation equivalents of a machine readable bilingual dictionary in three main groups: lexical units, lexical gaps (that is cases when a lexical concept of a language does not have a correspondent in the other language) and translation equivalents that need to be manually classified as lexical units or lexical gaps. This preventive classification reduces the manual work necessary to cope with lexical gaps in the construction of aligned multilingual wordnets.
100	Coreference Annotation: Whither?	The terms coreference and anaphora tend to be used inconsistently and interchangeably in much empirically-oriented work in NLP, and this threatens to lead to incoherent analyses of texts and arbitrary loss of information. This paper discusses the role of coreference annotation in Information Extraction, focussing on the coreference scheme defined for the MUC-7 evaluation exercise. We point out deficiencies in that scheme and make some suggestions towards a new annotation philosophy.
19	Coreference in Annotating a Large Corpus	The Prague Dependency Treebank (PDT) is a part of the Czech National Corpus, annotated with disambiguated structural descriptions representing the meaning of every sentence in its environment. To achieve that aim, it is necessary i.a. to make explicit (at least some basic) coreferential relations within the sentence boundaries and also beyond them. The PDT scenario includes both automatic and 'manual' procedures; among the former type, there is one that concerns coreference, indicating the lemma of the subject in a specific attribute of the label belonging to a node for a reflexive pronoun, and assigning the deleted nodes in coordinated constructions the lemmas of their counterparts in the given construction. 'Manual' operations restore nodes for the deleted items mostly as pronouns. The distinction between grammatical and textual coreference is reflected. In order to get a possibility of handling textual coreference, specific attributes reflect the linking of sentences to each other and to the context of situation, and the development of the degrees of activation of the 'stock of shared knowledge' will be registered in so far as they are derivable from the use of nouns in subsequent utterances in a discourse.
131	Coreference Resolution Evaluation Based on Descriptive Specificity	This paper introduces a new evaluation method for the coreference resolution task. Considering that coreference resolution is a matter of linking expressions to discourse referents, we set our evaluation criteron in terms of an evaluation of the denotations assigned to the expressions. This criterion requires that the coreference chains identified in one annotation stand in a one-to-one correspondence with the coreference chains in the other. To determine this correspondence and with a view to keep closer to what human interpretation of the coreference chains would be, we take into account the fact that, in a coreference chain, some expressions are more specific to their referent than others. With this observation in mind, we measure the similarity between the chains in one annotation and the chains in the other, and then compute the optimal similarity between the two annotations. Evaluation then consists in checking whether the denotations assigned to the expressions are correct or not. New measures to analyse errors are also introduced. A comparison with other methods is given at the end of the paper.
288	Corpora of Slovene Spoken Language for Multi-lingual Applications	The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools (EAGLES Handbook 1997). This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics.
187	Corpus Resources and Minority Language Engineering	Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g. Charoenporn, 1997, Hall and Hudson, 1997, Somers, 1997, Nirenberg and Raskin, 1998, Somers, 1998). However, much work related to low density languages is still in its infancy, or worse, work is blocked because the resources needed by language engineers are not available. In response to this situation, the MILLE (Minority Language Engineering) project was established by the Engineering and Physical Sciences Research Council (EPSRC) in the UK to discover what language corpora should be built to enable language engineering work on non-indigenous minority languages in the UK, most of which are typically low- density languages. This paper summarises some of the major findings of the MILLE project.
22	Creating and Using Domain-specific Ontologies for Terminological Applications	Huge volumes of scientific databases and text collections are constantly becoming available, but their usefulness is at present hampered by their lack of uniformity and structure. There is therefore an overwhelming need for tools to facilitate the processing and discovery of technical terminology, in order to make processing of these resources more efficient. Both NLP and statistical techniques can provide such tools, but they would benefit greatly from the availability of suitable lexical resources. While information resources do exist in some areas of terminology, these are not designed for linguistic use. In this paper, we investigate how one such resource, the UMLS, is used for terminological acquisition in the TRUCKS system, and how other domain-specific resources might be adapted or created for terminological applications.
52	Creation of Spoken Hebrew Databases	Two Spoken Hebrew databases were collected over fixed telephone lines at NSC - Natural Speech Communication. Their creation was based on the SpeechDat model, and represents the first comprehensive spoken database in Modern Hebrew that can be successfully applied to the teleservices industry. The speakers are a representative sample of Israelis, based on sociolinguistic factors such as age, gender, years of education and country of origin. The database includes, digit sequences, natural numbers, money amounts, time expressions, dates, spelled words, application words and phrases for teleservices (e.g., call, save, play), phonetically rich words, phonetically rich sentences, and names. Both read speech and spontaneous speech were elicited.
147	Cross-lingual Interpolation of Speech Recognition Models	A method is proposed for implementing the cross-lingual porting of recognition models for rapid prototyping of speech recognisers in new target languages, specifically when the collection of large speech corpora for training would be economically questionable. The paper describes a way to build up a multilingual model which includes the phonetic structure of all the constituent languages, and which can be exploited to interpolate the recognition units of a different language. The CTSU (Classes of Transitory-Stationary Units) approach is exploited to derive a well balanced set of recognition models, as a reasonable trade-off between precision and trainability. The phonemes of the untrained language are then mapped onto the multilingual inventory of recognition units, and the corresponding CTSUs are then obtained. The procedure was tested with a preliminary set of 10 Rumanian speakers starting from an Italian-English-Spanish CTSU model. The optimal mapping of the vowel phone set of this language onto the multilingual phone set was obtained by inspecting the F1 and F2 formants of the vowel sounds from two male and female Rumanian speakers, and by comparing them with the values of F1 and F2 of the other three languages. Results in terms of recognition word accuracy measured on a preliminary test set of 10 speakers are reported.