LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers and Abstracts
Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Improving Lexical Databases with Collocational Information: Data from Portuguese | This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can be conceived through relations linking the lexical components. Mechanisms for dealing with the collocation-hood of the expressions are required to be included in the systems, topographically, in their lexical modules. Lexical databases like wordnets, with a general architecture typically structured on semantic relations, make room for the specification of this phenomenon. This can be handled through the definition of ad-hoc relations expressing the different semantic effects the adjectival modification bring to nominal phrases, collocationally. | |
Integrating Seed Names and ngrams for a Named Entity List and Classifier | We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a memory-based machine learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61 % and a recall of 56 %. On free text, named-entity token labeling accuracy is 71 %. | |
Integrating Subject Field Codes into WordNet | In this paper, we present a lexical resource where WordNet synsets are annotated with Subject Field Codes. We discuss both the methodological issues we dealt with and the annotation techniques used. A quantitative analysis of the resource coverage, as well as a qualitative evaluation of the proposed annotations, are reported. | |
Interactive Corpus Annotation | We present an easy-to-use graphical tool for syntactic corpus annotation. This tool, Annotate, interacts with a part-of-speech tagger and a parser running in the background. The parser incrementally suggests single phrases bottom-up based on cascaded Markov models. A human annotator confirms or rejects the parser’s suggestions. This semi-automatic process facilitates a very rapid and efficient annotation. | |
Inter-annotator Agreement for a German Newspaper Corpus | This paper presents the results of an investigation on inter-annotator agreement for the NEGRA corpus, consisting of German newspaper texts. The corpus is syntactically annotated with part-of-speech and structural information. Agreement for part-of-speech is 98.6%, the labeled F-score for structures is 92.4%. The two annotations are used to create a common final version by discussing differences and by several iterations of cleaning. Initial and final versions are compared. We identify categories causing large numbers of differences and categories that are handled inconsistently. | |
Interarbora and Thistle - Delivering Linguistic Structure by the Internet | I describe an Internet service ''Interarbora'', which facilitates the visualization of tree structures. The service is built on top of a general purpose editor ''Thistle'', which allows the editing of diagrams and the generation of print format representations. | |
Introduction of KIBS (Korean Information Base System) Project | This project has been carried out on the basis of resources and tools for Korean NLP. The main research is the construction of raw corpus of 64 million tokens and Part-of-Speech tagged corpus of about 11 million tokens. And we develop some analytic tools to construct and some supporting tools to navigate them. This paper represents the present state of the work carried out by the KIBS project. We introduce a KAIST tag set of POS and syntax for standard corpus and annotation principles. And we explain several error types represented in tagged corpus. | |
IPA Japanese Dictation Free Software Project | Large vocabulary continuous speech recognition (LVCSR) is an important basis for the application development of speech recognition technology. We had constructed Japanese common LVCSR speech database and have been developing sharable Japanese LVCSR programs/models by the volunteer-based efforts. We have been engaged in the following two volunteer-based activities. a) IPSJ (Information Processing Society of Japan) LVCSR speech database working group. b) IPA (Information Technology Promotion Agency) Japanese dictation free software project. IPA Japanese dictation free software project (April 1997 to March 2000) is aiming at building Japanese LVCSR free software/models based on the IPSJ LVCSR speech database (JNAS) and Mainichi newspaper article text corpus. The software repository as the product of the IPA project is available to the public. More than 500 CD-ROMs have been distributed. The performance evaluation was carried out for the simple version, the fast version, and the accurate version in February 2000. The evaluation uses 200 sentence utterances from 46 speakers. The gender-independent HMM models and 20k/60k language models are used for evaluation. The accurate version with the 2000 HMM states and 16 Gaussian mixtures shows 95.9 % word correct rate. The fast version with the phonetic tied mixture HMM and the 1/10 reduced language model shows 92.2 % word correct rate and realtime speed. The CD-ROM with the IPA Japanese dictation free software and its developing workbench will be distributed by the registration to http://www.lang.astem.or.jp/dictation-tk/ or by sending e-mail to dictation-tk-request@astem.or.jp. | |
IREX: IR & IE Evaluation Project in Japanese | We will report on the IREX (Information Retrieval and Extraction Exercise) project. It is an evaluation-based project for Information Retrieval and Information Extraction in Japanese. The project started in May 1998 and concluded in September 1999 with the IREX workshop held in Tokyo with more than 150 attendance (IREX Commettee, 1999). There is a homepage of the project at (IREX, Homepage) and anyone can download almost all the data and the tools produced by the project for free. | |
Issues from Corpus Analysis that have influenced the On-going Development of Various Haitian Creole Text- and Speech-based NLP Systems and Applications | This paper describes issues that are relevant to using small- to large-sized corpora for the training and testing of various text- and speech-based natural language processing (NLP) systems for minority and vernacular languages. These R&D and commercial systems and applications include machine translation, orthography conversion, optical character recognition, speech recognition, and speech synthesis that have already been produced for the Haitian Creole (HC) language. Few corpora for minority and vernacular languages have been created specifically for language resource distribution and for NLP system training. As a result, some of the only available corpora are those that are produced within real end-user environments. It is therefore of utmost importance that written language standards be created and then observed so that research on various text- and speech-based systems can be fruitful. In doing so, this also provides vernacular and minority languages with the opportunity to have an impact within the globalization and advanced communication needs efforts of the modern day world. Such technologies can significantly influence the status of these languages, yet the lack of standardization is a severe impediment to technological development. A number of relevant issues are discussed in this paper. | |
Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium | The Linguistic Data Consortium (LDC) is a non-profit consortium of universities, companies and government research laboratories that supports education, research and technology development in language related disciplines by collecting or creating, distributing and archiving language resources including data and accompanying tools, standards and formats. LDC was founded in 1992 with a grant from the Defense Advanced Research Projects Agency (DARPA) to the University of Pennsylvania as host organization. LDC publication and distribution activities self-support from membership fees and data sales while new data creation is supported primarily by grants from DARPA and the National Science Foundation. Recent developments in the creation and use of language resources demand new roles for international data centers. Since our report at the last Language Resource and Evaluation Conference in Granada in 1998, LDC has observed growth in the demand for language resources along multiple dimensions: larger corpora with more sophisticated annotation in a wider variety of languages are used in an increasing number of language related disciplines. There is also increased demand for reuse of existing corpora. Most significantly, small research groups are taking advantage of advances in microprocessor technology, data storage and internetworking to create their own corpora. This has lead to the birth of new annotation practices whose very variety creates barriers to data sharing. This paper will describe recent LDC efforts to address emerging issues in the creation and distribution of language resources. | |
Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language | In this paper, different issues in design, collection and evaluation of the large vocabulary telephone speech corpus of Slovenian language are discussed. The database is composed of three text corpora containing 1530 different sentences. It contains read speech of 82 speakers where each speaker read in average more than 200 sentences and 21 speakers read also the text passage of 90 sentences. The initial manual segmentation and labeling of speech material was performed. Based on this the automatic segmentation was carried out. The database should facilitate the development of speech recognition systems to be used in dictation tasks over the telephone. Until now the database was used mostly for isolated digit recognition tasks and word spotting. | |
Issues in the Evaluation of Spoken Dialogue Systems - Experience from the ACCeSS Project | We describe the framework and present detailed results of an evaluation of 1.500 dialogues recorded during a three-months field-trial of the ACCeSS Dialogue System. The system was routing incoming calls to agents of a call-center and handled about 100 calls per day. | |
ItalWordNet: a Large Semantic Database for Italian | The focus of this paper is on the work we are carrying out to develop a large semantic database within an Italian national project, SI-TAL, aiming at realizing a set of integrated (compatible) resources and tools for the automatic processing of the Italian language. Within SI-TAL, ItalWordNet is the reference lexical resource which will contain information related to about 130,000 word senses grouped into synsets. This lexical database is not being created ex novo, but extending and revising the Italian lexical wordnet built in the framework of the EuroWordNet project. In this paper we firstly describe how the lexical coverage of our wordnet is being extended by adding adjectives, adverbs and proper nouns, plus a terminological subset belonging to the economic and financial domain. The relevant changes involved by these extensions both in the linguistic model and in the data structure are then illustrated. In particular we discuss i) the new semantic relations identified to encode information on adjectives and adverbs ii) the new architecture including the terminological subset. |