LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.
Paper | Paper Title | Abstract |
---|---|---|
52 | Creation of Spoken Hebrew Databases | Two Spoken Hebrew databases were collected over fixed telephone lines at NSC - Natural Speech Communication. Their creation was based on the SpeechDat model, and represents the first comprehensive spoken database in Modern Hebrew that can be successfully applied to the teleservices industry. The speakers are a representative sample of Israelis, based on sociolinguistic factors such as age, gender, years of education and country of origin. The database includes, digit sequences, natural numbers, money amounts, time expressions, dates, spelled words, application words and phrases for teleservices (e.g., call, save, play), phonetically rich words, phonetically rich sentences, and names. Both read speech and spontaneous speech were elicited. |
53 | PLEDIT - A New Efficient Tool for Management of Multilingual Pronunciation Lexica and Batchlists | The program tool PLEDIT - Pronunciation Lexica Editor - has been created for efficient handling with pronunciation lexica and batchlists. PLEDIT is designed as a GUI, which incorporates tools for fast and efficient management of pronunciation lexica and batchlists. The tool is written in cl/Tk/Tix and can thus be easily ported to different platforms. PLEDIT supports three lexicon format types, which are Siemens, SpeechDat and CMU lexicon formats. PLEDIT enables full editing capability for lexica and batchlists and supports work with multilingual resources. Some functions have been built in as external programs written in the C program language. With these external programs higher speed and efficiency of the PLEDIT have been achieved. |
55 | Use of Greek and Latin Forms for Term Detection | It is well known that many languages make use of neo-classical compounds, and that some domains with a very long tradition like medicine made an intense use of such morphemes. This phenomenon has been largely studied for different languages with the common result that a relatively short number of morphemes allows the detection of a high number of specialised terms to be produced. We believe that the use of such morphological knowledge may help a term detector in discovering very specialised terms. In this paper we propose a module to be included in a term extractor devoted specifically to detect terms that include neo-classical compounds. We describe such module as well the results obtained from it. |
56 | Methods and Metrics for the Evaluation of Dictation Systems: a Case Study | This paper describes the practical evaluation of two commercial dictation systems in order to assess the potential usefulness of such technology in the specific context of a translation service translating legal text into Italian. The service suffers at times from heavy workload, lengthy documents and short deadlines. Use of dictation systems accepting continuous speech might improve productivity at these times. Design and execution of the evaluation followed the methodology worked out by the EAGLES Evaluation Working Group. The evaluation therefore also constitutes a test bed application of this methodology. |
58 | Cairo: An Alignment Visualization Tool | While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system. We developed Cairo to fill this need. Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters. This program can be easily adapted for visualization of correspondences in bi-texts based on probability distributions. |
59 | An XML-based Representation Format for Syntactically Annotated Corpora | This paper discusses a general approach to the description and encoding of linguistic corpora annotated with hierarchically structured syntactic information. A general format can be motivated by the variety and incompatibility of existing annotation formats. By using XML as a representation format the theoretical and technical problems encountered can be overcome. |
60 | An Experiment of Lexical-Semantic Tagging of an Italian Corpus | The availability of semantically tagged corpora is becoming a very important and urgent need for training and evaluation within a large number of applications but also they are the natural application and accompaniment of semantic lexicons of which they constitute both a useful testbed to evaluate their adequacy and a repository of corpus examples for the attested senses. It is therefore essential that sound criteria are defined for their construction and a specific methodology is set up for the treatment of various semantic phenomena relevant to this level of description. In this paper we present some observations and results concerning an experiment of manual lexical-semantic tagging of a small Italian corpus performed within the framework of the ELSNET project. The ELSNET experimental project has to be considered as a feasibility study. It is part of a preparatory and training phase, started with the Romanseval/Senseval experiment (Calzolari et al., 1998), and ending up with the lexical-semantic annotation of larger quantities of semantically annotated texts such as the syntactic-semantic Treebank which is going to be annotated within an Italian National Project (SI-TAL). Indeed, the results of the ELSNET experiment have been of utmost importance for the definition of the technical guidelines for the lexical-semantic level of description of the Treebank. |
61 | SIMPLE: A General Framework for the Development of Multilingual Lexicons | The project LE-SIMPLE is an innovative attempt of building harmonized syntactic-semantic lexicons for 12 European languages, aimed at use in different Human Language Technology applications. SIMPLE provides a general design model for the encoding of a large amount of semantic information, spanning from ontological typing, to argument structure and terminology. SIMPLE thus provides a general framework for resource development, where state-of-the-art results in lexical semantics are coupled with the needs of Language Engineering applications accessing semantic information. |
62 | Electronic Language Resources for Polish: POLEX, CEGLEX and GRAMLEX | We present theoretical results and resources obtained within three projects: national project POLEX, Copernicus 1 Project CEGLEX (1032) and Copernicus Project GRAMLEX (632). Morphological resources obtained within these projects contribute to fill-in the gap on the map of available electronic language resources for Polish. After a short presentation of some common methodological bases defined within the POLEX project, we proceed to present methodology and data obtained in CEGLEX and GRAMLEX projects. The intention of the Polish language part of CEGLEX was to test formats proposed by the GENELEX project against Polish data. The aim of the GRAMLEX project was to create a corpus-based morphological resources for Polish. GRAMLEX refers directly to the morphological part of the CEGLEX project. Large samples of data presented here are accessible at http://main.amu.edu.pl/~zlisi/projects.htm. |
63 | SPEECON - Speech Data for Consumer Devices | SPEECON, launched in February 2000, is a project focusing on collecting linguistic data for speech recogniser training. Put into action by an industrial consortium, it promotes the development of voice controlled consumer applications such as television sets, video recorders, audio equipment, toys, information kiosks, mobile phones, palmtop computers and car navigation kits. During the lifetime of the project, scheduled to last two years, partners will collect speech data for 18 languages or dialectal zones, including most of the languages spoken in the EU. Attention will also be devoted to research into the environment of the recordings, which are, like the typical surroundings of CE applications, at home, in the office, in public places or in moving vehicles. The following pages will give a brief overview of the workplan for the months to come. |
66 | A Treebank of Spanish and its Application to Parsing | This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction. |
67 | End-to-End Evaluation of Machine Interpretation Systems: A Graphical Evaluation Tool | VERBMOBIL as a long-term project of the Federal Ministry of Education, Science, Research and Technology aims at developing a mobile translation system for spontaneous speech. The source-language input consists of human speech (English, German or Japanese), the translation (bidirectional English-German and Japanese-German) and target-language output is effected by the VERBMOBIL system. As to the innovative character of the project new methods for end-to-end evaluation had to be developed by a subproject which has been established especially for this purpose. In this paper we present criteria for the evaluation of speech-to-speech translation systems and a tool for judging the translation quality which is called Graphical Evaluation Tool (GET)2 . |
68 | A Proposal for the Integration of NLP Tools using SGML-Tagged Documents | In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD. |
69 | A Bilingual Electronic Dictionary for Frame Semantics | Frame semantics is a linguistic theory which is currently gaining ground. The creation of lexical entries for a large number of words presupposes the development of complex lexical acquisition techniques in order to identify the vocabulary for describing the elements of a 'frame'. In this paper, we show how a lexical-semantic database compiled on the basis of a bilingual (English-French) dictionary can be used to identify some general frame elements which are relevant in a frame-semantic approach such as the one adopted in the FrameNet project (Fillmore & Atkins 1998, Gahl 1998). The database has been systematically enriched with explicit lexical-semantic relations holding between some elements of the microstructure of the dictionary entries. The manifold relationships have been labelled in terms of lexical functions, based on Mel'cuk's notion of co-occurrence and lexical-semantic relations in Meaning-Text Theory (Mel'cuk et al. 1984). We show how these lexical functions can be used and refined to extract potential realizations of frame elements such as typical instruments or typical locatives, which are believed to be recurrent elements in a large number of frames. We also show how the database organization of the computational lexicon makes it possible to readily access implicit and translationally-relevant combinatorial information. |
70 | The Evaluation of Systems for Cross-language Information Retrieval | We describe the creation of an infrastructure for the testing of cross-language text retrieval systems within the context of the Text REtrieval Conferences (TREC) organised by the US National Institute of Standards and Technology (NIST). The approach adopted and the issues that had to be taken into consideration when building a multilingual test suite and developing appropriate evaluation procedures to test cross-language systems are described. From 2000 on, a cross-language evaluation activity for European languages known as CLEF (Cross-Language Evaluation Forum) will be coordinated in Europe, while TREC will focus on Asian languages. The implications of the move to Europe and the intentions for the future are discussed. |
71 | Spoken Portuguese: Geographic and Social Varieties | The Spoken Portuguese: Geographic and Social Varieties project has as its main goal the Portuguese teaching as foreign language. The idea is to provide a collection of authentic spoken texts and to make it friendly usable. Therefore, a selection of spontaneous oral data was made, using either already compiled material or material recorded for this purpose. The final corpus constitution resulted in a representative sample that includes European, Brazilian and African Portuguese, as well as Macau and East-Timor Portuguese. In order to accomplish a functional product the Linguistics Center of Lisbon University developed a sound/text alignment software. The final result is a CD-ROM collection that contains 83 text files, 83 sound files and 83 files produced by the sound/text alignment tool. This independence between sound and text files allows the CD-ROM user to manipulate it for other purposes than the educational one. |
72 | Portuguese Corpora at CLUL | The Corpus de Referência do Português Contemporâneo (CRPC) is being developed in the Centro de Linguística da Universidade de Lisboa (CLUL) since 1988 under a perspective of research data enlargement, in the sense of concepts and hypothesis verification by rejecting the sole use of intuitive data. The intention of creating this open corpus is to establish an on-line representative sample collection of general usage contemporary Portuguese: a main corpus of great dimension as well as several specialized corpora. The CRPC has nowadays around 92 million words. Following the use in this area, the CRPC project intends to establish a linguistic database accessible to everyone interested in making theoretical and practical studies or applications. The Dialectal oral corpus of the Atlas Linguístico-Etnográfico de Portugal e da Galiza (ALEPG) is constituted by approximately 3500 hours of speech collected by the CLUL Dialectal Studies Research Group and recorded in analogic audio tape. This corpus contains mainly directed speech: answers to a linguistic questionnaire essentially lexical, but also focusing on some phonetic and morpho-phonological phenomena. An important part of spontaneous speech enables other kind of studies such as syntactic, morphological or phonetic ones. |
74 | Reusing the Mikrokosmos Ontology for Concept-based Multilingual Terminology Databases | This paper reports work carried out within a multilingual terminology project (OncoTerm) in which the Mikrokosmos ( µK) ontology (Mahesh, 1996; Viegas et al 1999) has been used as a language independent conceptual structure to achieve a truly concept-based terminology database (termbase, for short). The original ontology, containing nearly 4,700 concepts and available in Lisp-like format (January 1997 version), was first converted into a set of tables in a relational database. A specific software tool was developed in order to edit and browse this resource. This tool has now been integrated within a termbase editor and released under the name of OntoTerm™. In this paper we focus on the suitability of the µK ontology for the representation of domain-specific knowledge and its associated lexical items. |
75 | Abstraction of the EDR Concept Classification and its Effectiveness in Word Sense Disambiguation | The relation between the degree of abstraction of a concept and the explanation capability (validity and coverage) of conceptual description which is the constraint held between concepts is clarified experimentally by performing the operation called concept abstraction. This is the procedure that chooses a set certain of lower level concepts in a concept hierarchy and maps the set to one or more upper level (abstract) concepts. We took the three abstraction techniques of a flat depth, a flat size, and a flat probability method for the degree of abstraction. By taking these methods and degrees as a parameter, we applied the concept abstraction to the EDR Concept Classifications and performed word sense disambiguation test. The test set and the disambiguation knowledge were extracted as a co-occurrence expression from the EDR Corpora. Through the test, we found that the flat probability method gives the best result. We also carried out an evaluation by comparing the abstracted hierarchy with that of human introspection and found the flat size method gives the most similar results to human. These results would contribute to clarify the appropriate detailed-ness of a concept when given an application purpose of a concept hierarchy. |
76 | Will Very Large Corpora Play For Semantic Disambiguation The Role That Massive Computing Power Is Playing For Other AI-Hard Problems? | In this paper we formally analyze the relation between the amount of (possibly noisy) examples provided to a word-sense classification algorithm and the performance of the classifier. In the first part of the paper, we show that Computational Learning Theory provides a suitable theoretical framework to establish one such relation. In the second part of the paper, we will apply our theoretical results to the case of a semantic disambiguation algorithm based on syntactic similarity. |
77 | Guidelines for Japanese Speech Synthesizer Evaluation | Speech synthesis technology is one of the most important elements required for better human interfaces for communication and information systems.This paper describes the ''Guidelines for Speech Synthesis System Performance Evaluation Methods''created by the Speech Input/Output Systems Expert Committee of the Japan Electronic Industry Development Association (JEIDA).JEIDA has been investigating speech synthesizer evaluation methods since 1993 and previously reported the provisional version of the guidelines. The guidelines comprise six chapters: General rules,Text analysis evaluation,Syllable articulation test,Word intelligibility test, Sentence intelligibility test,and Over ll quality evaluation. |
78 | Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts | This paper presents how we constructed a tagged E-J parallel corpus of sample abstracts, which is the core language resource for our English abstract writing tool, the “Abstract Helper.” This writing tool is aimed at helping Japanese software engineers be more productive in writing by providing them with good models of English abstracts. We collected 539 English abstracts from technical journals/proceedings and prepared their Japanese translations. After analyzing the rhetorical structure of these sample abstracts, we tagged each sample abstract with both an abstract type and an organizational-scheme type. We also tagged each sample sentence with a sentence role and one or more verb complementation patterns. We also show that our tagged E-J parallel corpus of sample abstracts can be effectively used for providing users with both discourse-level guidance and sentence-level assistance. Finally, we discuss the outlook for further development of the “Abstract Helper.” |
79 | Extraction of Unknown Words Using the Probability of Accepting the Kanji Character Sequence as One Word | In this paper, we propose a method to extract unknown words, which are composed of two or three kahji characters, from Japanase text. Generally the known word composed of kanji characters are segmented into other words by the morphological analysis. Moreover, the appearance probability of each segmented word is small. By these features, we can define the measure of accepting two or three kanji character sequence as an unknown word. On the other hand, we can find some segmentation patterns of unknown words. By applying our measure to kanji character sequences which have these patterns, we can extract unknown words. In the experiment, the F-measuer for extraction of known words composed of two and three kanji characters was about 0.7 and 0.4 respectively. Our method does not need to use the frequency of the word in the training corpus to judge whether its word is the unknown word or not. Therefore, our method has the advantage that low frequent unknown words are extracted. |
80 | Automatic Speech Segmentation in High Noise Condition | The accurate segmentation of speech and end points detection in adverse condition is very important for building robust automatic speech recognition (ASR) systems. Segmentation of speech is not a trivial process - in high noise conditions it is very difficult to determine weak fricatives and nasals at end of the words. An efficient threshold (a priory defined) independent speech segmentation algorithm, robust to level of disturbance signals, is developed. The results show a significant improvement of robustness of proposed algorithm with respect to traditional algorithms. |
81 | Open Ended Computerized Overview of Controlled Languages | We have built up an open-ended computerized overview which can give instant access to information because controlled languages (CLs) are of undoubted interest (for safety and economic reasons, etc.) for industry and those willing to create a CL need to be aware of what has already been done. To achieve it, we had a close look at what has been written in the field of CLs and tried to get in touch with the persons involved in different projects (K. Barthe, E. Johnson, K. Godden, B. Arendse, E. Adolphson, T. Hartley, etc.) |
82 | Shallow Parsing and Functional Structure in Italian Corpora | In this paper we argue in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution - 50% or less of unambiguous tags when punctuation is subtracted from the total count. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. The current work includes a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora. In a preliminary experiment we made with automatic tagger, we obtained 99,97% accuracy in the training set and 99,03% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging. As to the shallow parser and GF-assigner we shall report on a first preliminary experiment on a manually verified subset made of 10,000 words. |
84 | Annotating, Disambiguating & Automatically Extending the Coverage of the Swedish SIMPLE Lexicon | During recent years the development of high-quality lexical resources for real-world Natural Language Processing (NLP) applications has gained a lot of attention by many research groups around the world, and the European Union, through the promotion of the language engineering projects dealing directly or indirectly with this topic. In this paper, we focus on ways to extend and enrich such a resource, namely the Swedish version of the SIMPLE lexicon in an automatic manner. The SIMPLE project ({\it Semantic Information for Multifunctional Plurilingual Lexica}) aims at developing wide-coverage semantic lexicons for 12 European languages, though on a rather small scale for practical NLP, namely less than 10,000 entries. Consequently, our intention is to explore and exploit various (inexpensive) methods to progressively enrich the resources and, subsequently, to annotate texts with the semantic information encoded within the framework of SIMPLE, and enhanced with the semantic data from the {\it Gothenburg Lexical DataBase} (GLDB) and from large corpora. |
85 | Providing Internet Access to Portuguese Corpora: the AC/DC Project | In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do portugues) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilizacao de Corpora, roughly ''Access and Availability of Corpora'') allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese. |
86 | Turkish Electronic Living Lexicon (TELL): A Lexical Database | The purpose of the TELL project is to create a database of Turkish lexical items which reflects actual speaker knowledge, rather than the normative and phonologically incomplete dictionary representations on which most of the existing phonological literature on Turkish is based. The database, accessible over the internet, should greatly enhance phonological, morphological, and lexical research on the language. The current version of TELL consists of the following components: • Some 15,000 headwords from the 2d and 3d editions of the Oxford Turkish-English dictionary, orthographically represented. • Proper names, including 175 place names from a guide of Istanbul, and 5,000 place names from a telephone area code directory of Turkey. • Phonemic transcriptions of the pronunciations of the same headwords and place names embedded in various morphological contexts. (Eliciting suffixed forms along with stems exposes any morphophonemic alternations that the headwords in question are subject to.) • Etymological information, garnered from a variety of etymological sources. • Roots for a number of morphologically complex headwords. The paper describes the construction of the current structure of the TELL database, points out potential questions that could be addressed by putting the database into use, and specifies goals for the next phase of the project. |
87 | Orthographic Transcription of the Spoken Dutch Corpus | This paper focuses on the specification of the orthographic transcription task in the Spoken Dutch Corpus, the problems encountered in making that specification and the evaluation experiments that were carried out to assess the transcription efficiency and the inter-transcriber consistency. It is stated that the role of the orthographic transcriptions in the Spoken Dutch Corpus is twofold: on the one hand, the transcriptions are important for future database users, on the other hand they are indispensable to the development of the corpus itself. The main objectives of the transcription task are the following: (1) to obtain a verbatim transcription that can be made with a minimum level of interpretation of the utterances; (2) to obtain an alignment of the transcription to the speech signal on the level of relatively short chunks; (3) to obtain a transcription that is useful to researchers working in several research areas and (4) to adhere to international standards for existing large speech corpora. In designing the transcription protocol and transcription procedure it was attempted to establish the best compromise between consistency, accuracy and usability of the output and efficiency of the transcription task. For example, the transcription procedure always consists of a first transcription cycle and a verification cycle. Some efficiency and consistency statistics derived from pilot experiments with several students transcribing the same material are presented at the end of the paper. In these experiments the transcribers were also asked to record the amount of time they spent on the different audio files, and to report difficulties they encountered in performing their task. |
90 | Development of Acoustic and Linguistic Resources for Research and Evaluation in Interactive Vocal Information Servers | This paper describes the setting up of a resource database for research and evaluation in the domain of interactive vocal information servers. All this resource development work took place in a research project aiming at the development of an advanced speech recognition system for the automatic processing of telephone directory requests and was performed on the basis of the Swiss-French Polyphone database (collected in the framework of the European SpeechDat project). Due to the unavailability of a properly orthographically transcribed, consistently labeled and tagged database of unconstrained speech (together with its associated lexicon) for the targeted area, we first concentrated on the annotation and structuration of the spoken requests data in order to make it profitable for lexical and linguistic modeling and for the evaluation of recognition results. A baseline speech recognition system was then trained on the newly developed resources and tested. Preliminary recognition experiments showed a relative improvement of 46% for the Word Error Rate (WER) compared to the results previously obtained with a baseline system very similar but working on the unconsistent natural speech database that was originally available. |
91 | An Architecture for Document Routing in Spanish: Two Language Components, Pre-processor and Parser | This paper describes the language components of a system for Document Routing in Spanish. The system identifies relevant terms for classification within involved documents by means of natural language processing techniques. These techniques are based on the isolation and normalization of syntactic unities considered relevant for the classification, especially noun phrases, but also other constituents built around verbs, adverbs, pronouns or adjectives. After a general introduction about the research project, the second Section relates our approach to the problem with other previous and current approaches, the third one describes corpora used for evaluating the system. The linguistic analysis architecture, including pre-processing and two different levels of syntactic analysis, is described in following fourth and fifth Sections, while the last one is dedicated to a comparative analysis of results obtained from the processing of corpora introduced in third Section. Certain future developments of the system are also included in this Section. |
92 | Target Suites for Evaluating the Coverage of Text Generators | Our goal is to evaluate the grammatical coverage of the surface realization component of a natural language generation system by means of target suites. We consider the utility of re-using for this purpose test suites designed to assess the coverage of natural language analysis / understanding systems. We find that they are of some interest, in helping inter-system comparisons and in providing an essential link to annotated corpora. But they have limitations. First, they contain a high proportion of ill-formed items which are inappropriate as targets for generation. Second, they omit phenomena such as discourse markers which are key issues in text production. We illustrate a partial remedy for this situation in the form of a text generator that annotates its own output to an externally specified standard, the TSNLP scheme. |
93 | LT TTT - A Flexible Tokenisation Tool | We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation |
94 | Perception and Analysis of a Reiterant Speech Paradigm: a Functional Diagnostic of Synthetic Prosody | A set of perception experiments,using reiterant speech,were designed to carry out a diagnostic of the segmentation /hierarchisation linguistic function of prosody.The prosodic parameters of F0,syllabic duration and intensity of the stimuli used during this experiment were extracted.Several dissimilarity measures (Correlation,root-mean-square distance and mutual information)were used to match the results of the subjective experiment.This comparison of the listeners ’perception with acoustic parameters is intended to underline the acoustic keys used by listeners to judge the adequacy of prosody to perform a given linguistic function. |
95 | Development and Evaluation of an Italian Broadcast News Corpus | This paper reports on the development and evaluation of an Italian broadcast news corpus at ITC-irst, under a contract with the Euro-pean Language resources Distribution Agency (ELDA). The corpus consists of 30 hours of recordings transcribed and annotated with conventions similar to those adopted by the Linguistic Data Consortium for the DARPA HUB-4 corpora. The corpus will be completed and released to ELDA by April 2000. |
96 | Multilingual Linguistic Resources: From Monolingual Lexicons to Bilingual Interrelated Lexicons | This paper describes a procedure to convert the PAROLE-SIMPLE monolingual lexicons into bilingual interrelated lexicons where each word sense of a given language is linked to the pertinent sense of the right words in one or more target lexicons. Nowadays, SIMPLE lexicons are monolingual although the ultimate goal of these harmonised monolingual lexicons is to build multilingual lexical resources. For achieving this goal it is necessary to automatise the linking among the different senses of the different monolingual lexicons, as the production of such multilingual relations by hand will be, as all tasks related with the development of linguistic resources, unaffordable in terms of human resources and time spent. The system we describe in this paper takes advantage of the SIMPLE model and the SIMPLE based lexicons so that, in the best case, it can find fully automatically the relevant sense-to-sense correspondences for determining the translational equivalence of two words in two different languages and, in the worst case, it will be able to narrow the set of admissible links between words and relevant senses. This paper also explores to what extent semantic encoding in already existing computational lexicons such as SIMPLE can help in overcoming the problems arisen when using monolingual meaning descriptions for bilingual links and aims to set the basis for defining a model for adding a bilingual layer to the SIMPLE model. This bilingual layer based on a bilingual relation model will be the basis indeed for defining the multilingual language resource we want PAROLE-SIMPLE lexicons to become. |
98 | Where Opposites Meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation | The paper describes the use of FAME, a functional annotation meta–scheme for comparison and evaluation of syntactic annotation schemes, i) as a flexible yardstick in multi–lingual and multi–modal parser evaluation campaigns and ii) for corpus annotation. We show that FAME complies with a variety of non–trivial methodological requirements, and has the potential for being effectively used as an “interlingua” between different syntactic representation formats. |
99 | Controlled Bootstrapping of Lexico-semantic Classes as a Bridge between Paradigmatic and Syntagmatic Knowledge: Methodology and Evaluation | Semantic classification of words is a highly context sensitive and somewhat moving target, hard to deal with and even harder to evaluate on an objective basis. In this paper we suggest a step–wise methodology for automatic acquisition of lexico–semantic classes and delve into the non trivial issue of how results should be evaluated against a top–down reference standard. |
100 | Coreference Annotation: Whither? | The terms coreference and anaphora tend to be used inconsistently and interchangeably in much empirically-oriented work in NLP, and this threatens to lead to incoherent analyses of texts and arbitrary loss of information. This paper discusses the role of coreference annotation in Information Extraction, focussing on the coreference scheme defined for the MUC-7 evaluation exercise. We point out deficiencies in that scheme and make some suggestions towards a new annotation philosophy. |