LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

295 Rarity of Words in a Language and in a Corpus A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared.

377 Recent Developments within the European Language Resources Association (ELRA) The main achievement of ELRA (the most visible) is the growth of its catalogue. The ELRA catalogue as of April 2000 lists 111 speech resources, 50 monolingual lexica, 113 multilingual lexica, 24 written corpora and 275 terminological databases. However, many Language Resources (LRs) need to be identified and/or produced. To this effect, ELRA is active in promoting and funding the co-production of new LRs through several calls for proposals. As for the validity of the existence of ELRA for the distribution of language resources, the statistics from the past two years speak for themselves. The 1999 fiscal report showed a rise with the sale of 217 LRs (122 for research and 95 for commercial purposes; with speech databases representing nearly 45%), compared to the sale of 180 LRs (90 for research and 90 for commercial purposes; with speech databases representing nearly 65%), in 1998 and to 33 sold in 1997. The other visible action of ELRA is its membership drive: since its foundation, ELRA has attracted an increasing number of members (from 63 in 1995 to 95 in 1999). This article is updated from a paper presented at Eurospeech'99.

167 Recruitment Techniques for Minority Language Speech Databases: Some Observations This paper describes the collection efforts for SpeechDat Cymru, a 2000-speaker database for Welsh, a minority language spoken by about 500,000 of the Welsh population. The database is part of the SpeechDat(II) project. General database details are discussed insofar as they affect recruitment strategies, and likely differences between minority language spoken language resource (SLR) and general SLR collection are noted. Individual recruitment techniques are then detailed, with an indication of their relative successes and relevance to minority language SLR collection generally. It is observed that no one technique was sufficient to collect the entire database, and that those techniques involving face-to-face recruitment by an individual closely involved with the database collection produced the best yields for effort expended. More traditional postal recruitment techniques were less successful. The experiences during collection underlined the importance of utilising enthusiastic recruiters, and taking advantage of the speaker networks present in the community.

307 Regional Pronunciation Variants for Automatic Segmentation The goal of this paper is to create an extended rule corpus with approximately 2300 phonetic rules which model segmental variation of regional variants of German. The phonetic rules express at a broad-phonetic level phenomena of phonetic reduction in German that occurs within words and across word boundaries. In order to get an improvement in automatic segmentation of regional speech variants, these rules are clustered and implemented depending on regional specification in the Munich Automatic Segmentation System.

182 Resources for Lexicalized Tree Adjoining Grammars and XML Encoding: TagML This work addresses both practical and theorical purposes for the encoding and the exploitation of linguistic resources for feature based Lexicalized Tree Adjoining grammars (LTAG). The main goals of these specifications are the following ones: 1. Define a recommendation by the way of an XML (Bray et al., 1998) DTD or schema (Fallside, 2000) for encoding LTAG resources in order to exchange grammars, share tools and compare parsers. 2. Exploit XML, its features and the related recommendations for the representation of complex and redundant linguistic structures based on a general methodology. 3. Study the resource organisation and the level of generalisation which are relevant for a lexicalized tree grammar.

241 Resources for Multilingual Text Generation in Three Slavic Languages The paper discusses the methods followed to re-use a large-scale, broad-coverage English grammar for constructing similar scale grammars for Bulgarian, Czech and Russian for the fast prototyping of a multilingual generation system. We present (1) the theoretical and methodological basis for resource sharing across languages, (2) the use of a corpus-based contrastive register analysis, in particular, contrastive analysis of mood and agency. Because the study concerns reuse of the grammar of a language that is typologically quite different from the languages treated, the issues addressed in this paper appear relevant to a wider range of researchers in need of large-scale grammars for less-researched languages.

298 Reusability as Easy Adaptability: A Substantial Advance in NL Technology The design and implementation of new applications in NLP at low costs mostly depends upon the availability of technologies oriented to the solution of any specific problem. The success of this task, besides the use of widely agreed formats and standards, relies upon at least two families of tools, those for managing and updating, and those for projecting an ''application view-point'' onto the data in the repository. This approach has different realizations if applied to a dictionary, a corpus, or a grammar. Some examples, taken frrom European and other industrial projects, show that reusability: a) in the building of industrial prototypes consists in the easy reconfiguration of resources (dictionary and grammar), easy portability and easy recombination of tools, by means of simple APIs, as well as on different implementation platforms: b) in the building of advanced applications still consists in the same features, together with the possibility of opening different view-points on dictionaries and grammars.

74 Reusing the Mikrokosmos Ontology for Concept-based Multilingual Terminology Databases This paper reports work carried out within a multilingual terminology project (OncoTerm) in which the Mikrokosmos ( µK) ontology (Mahesh, 1996; Viegas et al 1999) has been used as a language independent conceptual structure to achieve a truly concept-based terminology database (termbase, for short). The original ontology, containing nearly 4,700 concepts and available in Lisp-like format (January 1997 version), was first converted into a set of tables in a relational database. A specific software tool was developed in order to edit and browse this resource. This tool has now been integrated within a termbase editor and released under the name of OntoTerm™. In this paper we focus on the suitability of the µK ontology for the representation of domain-specific knowledge and its associated lexical items.

199 Rule-based Tagging: Morphological Tagset versus Tagset of Analytical Functions This work presents a part of a more global study on the problem of parsing of Czech and on the knowledge extraction capabilities of the Rule-based method. It is shown that the successfulness of the Rule-based method for English and its unsuccessfulness for Czech, is not only due to the small cardinality of the English tagset (as it is usually claimed) but mainly depends on its structure (”regularity” of the language information).

370 Russian Monitor Corpora: Composition, Linguistic Encoding and Internet Publication The LinGO (Linguistic Grammars Online) project’s English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingo.stanford.edu). Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels.

face="Verdana">

Paper	Paper Title	Abstract
295	Rarity of Words in a Language and in a Corpus	A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared.
377	Recent Developments within the European Language Resources Association (ELRA)	The main achievement of ELRA (the most visible) is the growth of its catalogue. The ELRA catalogue as of April 2000 lists 111 speech resources, 50 monolingual lexica, 113 multilingual lexica, 24 written corpora and 275 terminological databases. However, many Language Resources (LRs) need to be identified and/or produced. To this effect, ELRA is active in promoting and funding the co-production of new LRs through several calls for proposals. As for the validity of the existence of ELRA for the distribution of language resources, the statistics from the past two years speak for themselves. The 1999 fiscal report showed a rise with the sale of 217 LRs (122 for research and 95 for commercial purposes; with speech databases representing nearly 45%), compared to the sale of 180 LRs (90 for research and 90 for commercial purposes; with speech databases representing nearly 65%), in 1998 and to 33 sold in 1997. The other visible action of ELRA is its membership drive: since its foundation, ELRA has attracted an increasing number of members (from 63 in 1995 to 95 in 1999). This article is updated from a paper presented at Eurospeech'99.
167	Recruitment Techniques for Minority Language Speech Databases: Some Observations	This paper describes the collection efforts for SpeechDat Cymru, a 2000-speaker database for Welsh, a minority language spoken by about 500,000 of the Welsh population. The database is part of the SpeechDat(II) project. General database details are discussed insofar as they affect recruitment strategies, and likely differences between minority language spoken language resource (SLR) and general SLR collection are noted. Individual recruitment techniques are then detailed, with an indication of their relative successes and relevance to minority language SLR collection generally. It is observed that no one technique was sufficient to collect the entire database, and that those techniques involving face-to-face recruitment by an individual closely involved with the database collection produced the best yields for effort expended. More traditional postal recruitment techniques were less successful. The experiences during collection underlined the importance of utilising enthusiastic recruiters, and taking advantage of the speaker networks present in the community.
307	Regional Pronunciation Variants for Automatic Segmentation	The goal of this paper is to create an extended rule corpus with approximately 2300 phonetic rules which model segmental variation of regional variants of German. The phonetic rules express at a broad-phonetic level phenomena of phonetic reduction in German that occurs within words and across word boundaries. In order to get an improvement in automatic segmentation of regional speech variants, these rules are clustered and implemented depending on regional specification in the Munich Automatic Segmentation System.
182	Resources for Lexicalized Tree Adjoining Grammars and XML Encoding: TagML	This work addresses both practical and theorical purposes for the encoding and the exploitation of linguistic resources for feature based Lexicalized Tree Adjoining grammars (LTAG). The main goals of these specifications are the following ones: 1. Define a recommendation by the way of an XML (Bray et al., 1998) DTD or schema (Fallside, 2000) for encoding LTAG resources in order to exchange grammars, share tools and compare parsers. 2. Exploit XML, its features and the related recommendations for the representation of complex and redundant linguistic structures based on a general methodology. 3. Study the resource organisation and the level of generalisation which are relevant for a lexicalized tree grammar.
241	Resources for Multilingual Text Generation in Three Slavic Languages	The paper discusses the methods followed to re-use a large-scale, broad-coverage English grammar for constructing similar scale grammars for Bulgarian, Czech and Russian for the fast prototyping of a multilingual generation system. We present (1) the theoretical and methodological basis for resource sharing across languages, (2) the use of a corpus-based contrastive register analysis, in particular, contrastive analysis of mood and agency. Because the study concerns reuse of the grammar of a language that is typologically quite different from the languages treated, the issues addressed in this paper appear relevant to a wider range of researchers in need of large-scale grammars for less-researched languages.
298	Reusability as Easy Adaptability: A Substantial Advance in NL Technology	The design and implementation of new applications in NLP at low costs mostly depends upon the availability of technologies oriented to the solution of any specific problem. The success of this task, besides the use of widely agreed formats and standards, relies upon at least two families of tools, those for managing and updating, and those for projecting an ''application view-point'' onto the data in the repository. This approach has different realizations if applied to a dictionary, a corpus, or a grammar. Some examples, taken frrom European and other industrial projects, show that reusability: a) in the building of industrial prototypes consists in the easy reconfiguration of resources (dictionary and grammar), easy portability and easy recombination of tools, by means of simple APIs, as well as on different implementation platforms: b) in the building of advanced applications still consists in the same features, together with the possibility of opening different view-points on dictionaries and grammars.
74	Reusing the Mikrokosmos Ontology for Concept-based Multilingual Terminology Databases	This paper reports work carried out within a multilingual terminology project (OncoTerm) in which the Mikrokosmos ( µK) ontology (Mahesh, 1996; Viegas et al 1999) has been used as a language independent conceptual structure to achieve a truly concept-based terminology database (termbase, for short). The original ontology, containing nearly 4,700 concepts and available in Lisp-like format (January 1997 version), was first converted into a set of tables in a relational database. A specific software tool was developed in order to edit and browse this resource. This tool has now been integrated within a termbase editor and released under the name of OntoTerm™. In this paper we focus on the suitability of the µK ontology for the representation of domain-specific knowledge and its associated lexical items.
199	Rule-based Tagging: Morphological Tagset versus Tagset of Analytical Functions	This work presents a part of a more global study on the problem of parsing of Czech and on the knowledge extraction capabilities of the Rule-based method. It is shown that the successfulness of the Rule-based method for English and its unsuccessfulness for Czech, is not only due to the small cardinality of the English tagset (as it is usually claimed) but mainly depends on its structure (”regularity” of the language information).
370	Russian Monitor Corpora: Composition, Linguistic Encoding and Internet Publication	The LinGO (Linguistic Grammars Online) project’s English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingo.stanford.edu). Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels.