LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

292 Labeling of Prosodic Events in Slovenian Speech Database GOPOLIS The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classi-fication of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction.

353 Language Resources as by-Product of Evaluation: The MULTITAG Example In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the by-products of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating sys-tems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MULTITAG project funded by CNRS has built from the by-products of the GRACE evaluation campaign (French Part-Of-Speech tagging system evaluation campaign) a corpus of around 1 million words, annotated with a fine grained tagset derived from the EAGLES and MULTEXT projects. A brief presentation of the state of the art in Part-Of-Speech (POS) tagging and of the problem posed by its evaluation is given at the beginning, then the corpus itself is presented along with the procedure used to produce and validate it. In particular, the cost reduction brought by using this method instead of more classical methods is presented and its generalization to other control task is discussed in the conclusion.

297 Language Resources Development at the Spanish Royal Academy This paper explains some of the most relevant issues concerning the development of language resources at the Spanish Royal Academy. Two 125-M words corpus of Spanish language (synchronic and diachronic) and three specialized corpus has been developed. Around the corpus, RAE is also developing NLP tools and resources to morpho-syntactically annotate them. Some of the most relevant are: The Computational Lexicon, the Morphological analysis tools, the Disambiguation grammars and the Tokenizer generator. The last section describes the lexicographic use of corpus materials and includes a brief description of the Corpus-based lexicographical workbench and his related tools.

210 Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora.

234 Layout Annotation in a Corpus of Patient Information Leaflets We discuss the problems and issues that arised during the development of a procedure for annotating layout in a corpus of Patient Information Leaflets. We show how the genre of the corpus as well as the aim of the annotation influenced the annotation scheme. We also describe the automatic annotation procedure.

281 Le Programme Compalex (COMPAraison LEXicale)

213 Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing (Utsuro et al., 2000) proposed statistical method for learning dependency preference of Japanese subordinate clauses, in which scopeembedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Following (Utsuro et al., 2000), this paper presents detailed results of evaluating the proposed method by comparing it with several closely related existing techniques and shows that the proposed method outperforms those existing techniques.

145 Learning Verb Subcategorization from Corpora: Counting Frame Subsets We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88 % accuracy on unseen parsed text.

215 Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation For several years we have been conducting Accuracy Based Evaluations (ABE) of the JANUS speech-to-speech MT system (Gates et al., 1997) which measure quality and fidelity of translation. Recently we have begun to design a Task Based Evaluation for JANUS (Thomas, 1999) which measures goal completion. This paper describes what we have learned by comparing the two types of evaluation. Both evaluations (ABE and TBE) were conducted on a common set of user studies in the semantic domain of travel planning.

122 Lexical and Translation Equivalence in Parallel Corpora In the present paper we intend to investigate to what extent use of parallel corpora can help to eliminate some of the difficulties noted with bilingual dictionaries. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give and the preposition with. George Orwell's novel 1984 was used as source material, which is available in English-Hungarian sentence aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology.

148 Lexicalised Systematic Polysemy in WordNet This paper describes an attempt to gain more insight into the mechanisms that underlie lexicalised sy phenomenon is interpreted as systematic sense combinations that are valid for more than one word. The WordNet is exploited to create a working definition of systematic polysemy and extract polysemic patt isation that allows the identification of fine-grained semantic relations between the senses of the words par ic polysemic pattern.

271 LEXIPLOIGISSI: An Educational Platform for the Teaching of Terminology in Greece This paper introduces a project, LEXIPLOIGISSI * , which involves use of language resources for educational purposes. More particularly, the aim of the project is to develop written corpora, electronic dictionaries and exercises to enhance students’ reading and writing abilities in six different school subjects. It is the product of a small-scale pilot program that will be part of the school curriculum in the three grades of Upper Secondary Education in Greece. The application seeks to create exploratory learning environments in which digital sound, image, text and video are fully integrated through the educational platform and placed under the direct control of users who are able to follow individual pathways through data stores.

214 Live Lexicons and Dynamic Corpora Adapted to the Network Resources for Chinese Spoken Language Processing Applications in an Internet Era In the future network era, huge volume of information on all subject domains will be readily available via the network. Also, all the network information are dynamic, ever-changing and exploding. Furthermore, many of the spoken language processing applications will have to do with the content of the network information, which is dynamic. This means dynamic lexicons, language models and so on will be required. In order to cope with such a new network environment, automatic approaches for the collection, classification, indexing, organization and utilization of the linguistic data obtainable from the networks for language processing applications will be very important. On the one hand, high performance spoken language technology can hopefully be developed based on such dynamic linguistic data on the network. On the other hand, it is also necessary that such spoken language technology can be intelligently adapted to the content of the dynamic and the ever-changing network information. Some basic concept for live lexicons and dynamic corpora adapted to the network resources has been developed for Chinese spoken language processing applications and briefly summarized here in this paper. Although the major considerations here are for Chinese language, the concept may equally apply to other languages as well.

299 Looking for Errors: A Declarative Formalism for Resource-adaptive Language Checking The paper describes a phenomenon-based approach to grammar checking, which draws on the integration of different shallowNLP technologies, including morphological and POS taggers, as well as probabilistic and rule-based partial parsers. We present a declarative specification formalism for grammar checking and controlled language applications which greatly facilitates the development of checking components.

93 LT TTT - A Flexible Tokenisation Tool We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation

Paper	Paper Title	Abstract
292	Labeling of Prosodic Events in Slovenian Speech Database GOPOLIS	The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classi-fication of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction.
353	Language Resources as by-Product of Evaluation: The MULTITAG Example	In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the by-products of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating sys-tems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MULTITAG project funded by CNRS has built from the by-products of the GRACE evaluation campaign (French Part-Of-Speech tagging system evaluation campaign) a corpus of around 1 million words, annotated with a fine grained tagset derived from the EAGLES and MULTEXT projects. A brief presentation of the state of the art in Part-Of-Speech (POS) tagging and of the problem posed by its evaluation is given at the beginning, then the corpus itself is presented along with the procedure used to produce and validate it. In particular, the cost reduction brought by using this method instead of more classical methods is presented and its generalization to other control task is discussed in the conclusion.
297	Language Resources Development at the Spanish Royal Academy	This paper explains some of the most relevant issues concerning the development of language resources at the Spanish Royal Academy. Two 125-M words corpus of Spanish language (synchronic and diachronic) and three specialized corpus has been developed. Around the corpus, RAE is also developing NLP tools and resources to morpho-syntactically annotate them. Some of the most relevant are: The Computational Lexicon, the Morphological analysis tools, the Disambiguation grammars and the Tokenizer generator. The last section describes the lexicographic use of corpus materials and includes a brief description of the Corpus-based lexicographical workbench and his related tools.
210	Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts	This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora.
234	Layout Annotation in a Corpus of Patient Information Leaflets	We discuss the problems and issues that arised during the development of a procedure for annotating layout in a corpus of Patient Information Leaflets. We show how the genre of the corpus as well as the aim of the annotation influenced the annotation scheme. We also describe the automatic annotation procedure.
281	Le Programme Compalex (COMPAraison LEXicale)
213	Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing	(Utsuro et al., 2000) proposed statistical method for learning dependency preference of Japanese subordinate clauses, in which scopeembedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Following (Utsuro et al., 2000), this paper presents detailed results of evaluating the proposed method by comparing it with several closely related existing techniques and shows that the proposed method outperforms those existing techniques.
145	Learning Verb Subcategorization from Corpora: Counting Frame Subsets	We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88 % accuracy on unseen parsed text.
215	Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation	For several years we have been conducting Accuracy Based Evaluations (ABE) of the JANUS speech-to-speech MT system (Gates et al., 1997) which measure quality and fidelity of translation. Recently we have begun to design a Task Based Evaluation for JANUS (Thomas, 1999) which measures goal completion. This paper describes what we have learned by comparing the two types of evaluation. Both evaluations (ABE and TBE) were conducted on a common set of user studies in the semantic domain of travel planning.
122	Lexical and Translation Equivalence in Parallel Corpora	In the present paper we intend to investigate to what extent use of parallel corpora can help to eliminate some of the difficulties noted with bilingual dictionaries. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give and the preposition with. George Orwell's novel 1984 was used as source material, which is available in English-Hungarian sentence aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology.
148	Lexicalised Systematic Polysemy in WordNet	This paper describes an attempt to gain more insight into the mechanisms that underlie lexicalised sy phenomenon is interpreted as systematic sense combinations that are valid for more than one word. The WordNet is exploited to create a working definition of systematic polysemy and extract polysemic patt isation that allows the identification of fine-grained semantic relations between the senses of the words par ic polysemic pattern.
271	LEXIPLOIGISSI: An Educational Platform for the Teaching of Terminology in Greece	This paper introduces a project, LEXIPLOIGISSI * , which involves use of language resources for educational purposes. More particularly, the aim of the project is to develop written corpora, electronic dictionaries and exercises to enhance students’ reading and writing abilities in six different school subjects. It is the product of a small-scale pilot program that will be part of the school curriculum in the three grades of Upper Secondary Education in Greece. The application seeks to create exploratory learning environments in which digital sound, image, text and video are fully integrated through the educational platform and placed under the direct control of users who are able to follow individual pathways through data stores.
214	Live Lexicons and Dynamic Corpora Adapted to the Network Resources for Chinese Spoken Language Processing Applications in an Internet Era	In the future network era, huge volume of information on all subject domains will be readily available via the network. Also, all the network information are dynamic, ever-changing and exploding. Furthermore, many of the spoken language processing applications will have to do with the content of the network information, which is dynamic. This means dynamic lexicons, language models and so on will be required. In order to cope with such a new network environment, automatic approaches for the collection, classification, indexing, organization and utilization of the linguistic data obtainable from the networks for language processing applications will be very important. On the one hand, high performance spoken language technology can hopefully be developed based on such dynamic linguistic data on the network. On the other hand, it is also necessary that such spoken language technology can be intelligently adapted to the content of the dynamic and the ever-changing network information. Some basic concept for live lexicons and dynamic corpora adapted to the network resources has been developed for Chinese spoken language processing applications and briefly summarized here in this paper. Although the major considerations here are for Chinese language, the concept may equally apply to other languages as well.
299	Looking for Errors: A Declarative Formalism for Resource-adaptive Language Checking	The paper describes a phenomenon-based approach to grammar checking, which draws on the integration of different shallowNLP technologies, including morphological and POS taggers, as well as probabilistic and rule-based partial parsers. We present a declarative specification formalism for grammar checking and controlled language applications which greatly facilitates the development of checking components.
93	LT TTT - A Flexible Tokenisation Tool	We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation