LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers and Abstracts
Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. List of all papers and abstracts |
The Cost258 Signal Generation Test Array | This paper describes a benchmark for Analysis-Modification-Synthesis Systems (AMSS) that are back-ends of all concatenative speech synthesis systems. After introducing the motivations and principles underlying this initiative, we present here a first anonymous objective evaluation comparing the performance of 5 such AMSS. | |
Collocations as Word Co-ocurrence Restriction Data - An Application to Japanese Word Processor - | Collocations, the com bination of specific words are quite useful linguistic resources for NLP in general. The purpose of this paper is to show their usefulness, exem plifying an application to K anji character decision processes for Japanese w ord processors. U nlike recent trials of autom atic extraction, our collocations were collected m anually through many years of intensive investigation of corpus. Our collection procedure consists of (1) finding a proper com bination of words in a corpus and (2) recollecting similar com binations of words, incited by it. This procedure, which depends on hum an judgm ent and the enrichm ent of data by association, is effective for rem edying the sparseness of data problem , although the arbitrariness of hum an judgm ent is inevitable. A pproximately seventy two thousand and four hundred collocations w ere used as w ord co-occurrence restriction data for deciding K anji characters in the processing of Japanese w ord processores. Experiments have show n that the collocation data yield 8.9% higher fraction of Kana-to-Kanji character conversion accuracy than the system w hich uses no collocation data and 7.0% higher, than a com m ercial word processor software of average perform ance. | |
Enhancing the TDT Tracking Evaluation | Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative concerned with finding groups of stories on the same topic (tdt, 1998). The goal is to build systems that can segment, detect, and track incoming news stories (possibly from multiple continuous feeds) with respect to pre-defined topics. While the detection task detects the first story on a particular topic, the tracking task determines, for each story, which topic it is relevant to. This paper will discuss the algorithm currently used for evaluating systems for the tracking task, present some of its limitation, and propose a new algorithm that enhances the current evaluation. | |
GREEK ToBI: A System for the Annotation of Greek Speech Corpora | Greek ToBI is a system for the annotation of (Standard) Greek spoken corpora, that encodes intonational, prosodic and phonetic information. It is used to develop a large and publicly available database of prosodically annotated utterances for research, engineering and educational purposes. Greek ToBI is based on the system developed for American English (ToBI), but includes novel features (“tiers”) designed to address particularities of Greek prosody that merit annotation, such as stress and juncture. Thus Greek ToBI includes five tiers: the Tone Tier shows the intonational analysis of the utterance; the Prosodic Words Tier is a phonetic transcription; the Break Index Tier shows indices of cohesion; the Words Tier gives the text in romanization; the Miscellaneous Tier is used to encode other relevant information (e.g., disfluency or pitch-halving). The development of GRToBI is largely based on the transcription and analysis of a corpus of spoken Greek, that includes data from several speakers and speech styles, but also draws on existing quantitative research on Greek prosody. | |
English Senseval: Report and Results | There are now many computer programs for automatically determining which sense a word is being used in. One would like to be able to say which were better, which worse, and also which words, or varieties of language, presented particular problems to which programs. In 1998 a first evaluation exercise, SENSEVAL, took place. The English component of the exercise is described, and results presented. | |
SALA: SpeechDat across Latin America. Results of the First Phase | The objective of the SALA (SpeechDat across Latin America) project is to record large SpeechDat-like databases to train telephone speech recognisers for any country in Latin America. The SALA consortium is composed by several European companies, (CSELT, Italy; Lernout & Hauspie, Belgium; Philips, Germany; Siemens AG, Germany; Vocalis, U.K.) and Universities (UPC Spain, SPEX The Netherlands). This paper gives an overview of the project, introduces the definition of the databases, shows the dialectal distribution in the countries where recordings take place and gives information about validation issues, actual status and practical experiences in recruiting and annotating such large databases in Latin America. | |
Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging | The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results. | |
TransSearch: A Free Translation Memory on the World Wide Web | A translation memory is an archive of existing translations, structured in such a way as to promote translation re-use. Under this broad definition, an interactive bilingual concordancing tool like the RALI’s TransSearch system certainly qualifies as a translation memory. This paper describes the Web-based version of TransSearch, which, for the last three years, has given Internet users access to a large English-French translation database made up of Canadian parliamentary debates. Despite the fact that the RALI has done very little to publicize the availability of TransSearch on the Web, the system has been attracting a growing and impressive number of users. We present some basic data on who is using TransSearch and how, data which was collected from the system’s log file and by means of a questionnaire recently added to our Web site. We conclude with a call to the international community to help set up a network of bi-textual databases like TransSearch, which translators around the world could freely access over the Web. | |
Semantic Encoding of Danish Verbs in SIMPLE - Adapting a Verb Framed Model to a Satellite-framed Language | In this paper we give an account of the representation of Danish verbs in the semantic lexicon model, SIMPLE. Danish is a satellite-framed language where prepositions and adverbial particles express what in many other languages form part of the meaning of the verb stem. This aspect of Danish - as well as of the other Scandinavian languages - challenges the borderlines of a universal, strictly modular framework which centralises around the governing word classes and their arguments. In particular, we look into the representation of phrasal verbs and we propose a classification into compositional and non-compositional phrasal verbs, respectively, and adopt a so-called split late strategy where non-compositional phrasal verbs are identified only at the semantic level of analysis. | |
A Comparison of Summarization Methods Based on Task-based Evaluation | A task-based evaluation scheme has been adopted as a new method of evaluation for automatic text summarization systems. It evaluates the performance of a summarization system in a given task, such as information retrieval and text categorization. This paper compares ten different summarization methods based on information retrieval tasks. In order to evaluate the system performance, the subjects’ speed and accuracy are measured in judging the relevance of texts using summaries. We also analyze the similarity of summaries in order to investigate the similarity of the methods. Furthermore, we analyze what factors can affect evaluation results, and describe the problems that arose from our experimental design, in order to establish a better evaluation scheme. | |
A Word Sense Disambiguation Method Using Bilingual Corpus | This paper proposes a word sense disambiguation (WSD) method using bilingual corpus in English-Chinese machine translation system. A mathematical model is constructed to disambiguate word in terms of context phrasal collocation. A rules learning algorithm is proposed, and an application algorithm of the learned rules is also provided, which can increase the recall ratio. Finally, an analysis is given by an experiment on the algorithm. Its application gives an increase of 10% in precision. | |
Perceptual Evaluation of a New Subband Low Bit Rate Speech Compression System based on Waveform Vector Quantization and SVD Postfiltering | This paper proposes a new low rate speech coding algorithm, based on a subband approach. At first, a frame of the incoming signal is fed to a low pass filter, thus yielding the low frequency (LF) part. By subtracting the latter from the incoming signal the high frequency (HF), non-smoothed part is obtained. The HF part is modeled using waveform vector quantisation (VQ), while the LF part is modeled using a spectral estimation method based on a Hankel matrix, its shift invariant property and SVD, called CSE. At the receiver side an adaptive postfiltering based on SVD is performed for the HF part, a simple resynthesis for the LF part, before the two components are added in order to produce the reconstructed signal. Progressive speech compression (variable degree of analysis/synthesis at transmitter/receiver) is thus possible resulting in a variable bit rate scheme. The new method is compared to the CELP algorithm at 4800 bps and is proven of similar quality, in terms of intelligibility and segmental SNR. Moreover, perceptual evaluation tests of the new method were conducted for different bit rates up to 1200 bps and the majority of the evaluators indicated that the technique provides intelligible reconstruction. | |
Terms Specification and Extraction within a Linguistic-based Intranet Service | This paper describes the adaptation and extension of an existing morphological system,Word Manager,and its integration into an intranet service of a large international bank.The system includes a tool for the analysis and extraction of simple and complex terms.As a side-effect the procedure for the definition of new terms has been consolidated.The intranet service analyzes HTML pages on the fly,compares the results with the vocabulary of an inhouse terminological database (CLS-TDB)and generates hyperlinks in case matches have been found.Currently,the service handles terms in both German and English.The implementation of the service for Italian,French and Spanish is under way. | |
Semantico-syntactic Tagging of Very Large Corpora: the Case of Restoration of Nodes on the Underlying Level | The Prague Dependency Treebank has been conceived of as a semi-automatic three-layer annotation system, in which the layers of morphemic and 'analytic' (surface-syntactic) tagging are followed by the layer of tectogrammatical tree structures. Two types of deletions are recognized: (i) those licensed by the grammatical properties of the given sentence, and (ii) those possible only if the preceding context exhibits certain specific properties. Within group (i), either the position itself in the sentence structure is determined, but its lexical setting is 'free' (as e.g. with a deleted subject in Czech as a pro-drop language), or both the position and its 'filler' are determined. Group (ii) reflects the typological differences between English and Czech; the rich morphemics of the latter is more favorable for deletions. Several steps of the tagging procedure are carried out automatically, but most parts of the restoration of deleted nodes still have to be done ''manually''. If along with the node that is being restored, also nodes depending on it are deleted, then these are restored only if they function as arguments or obligatory adjuncts. The large set of annotated utterances will make it possible to check and amend the present results, also with applications of statistic methods. Theoretical linguistics will be enabled to check its descriptive framework; the degree of automation of the procedure will then be raised, and the treebank will be useful for most different tasks in language processing. | |
Coreference in Annotating a Large Corpus | The Prague Dependency Treebank (PDT) is a part of the Czech National Corpus, annotated with disambiguated structural descriptions representing the meaning of every sentence in its environment. To achieve that aim, it is necessary i.a. to make explicit (at least some basic) coreferential relations within the sentence boundaries and also beyond them. The PDT scenario includes both automatic and 'manual' procedures; among the former type, there is one that concerns coreference, indicating the lemma of the subject in a specific attribute of the label belonging to a node for a reflexive pronoun, and assigning the deleted nodes in coordinated constructions the lemmas of their counterparts in the given construction. 'Manual' operations restore nodes for the deleted items mostly as pronouns. The distinction between grammatical and textual coreference is reflected. In order to get a possibility of handling textual coreference, specific attributes reflect the linking of sentences to each other and to the context of situation, and the development of the degrees of activation of the 'stock of shared knowledge' will be registered in so far as they are derivable from the use of nouns in subsequent utterances in a discourse. | |
Designing a Tool for Exploiting Bilingual Comparable Corpora | Translators have a real need for a tool that will allow them to exploit information contained in bilingual comparable corpora. ExTrECC is designed to be a semi-automatic tool that processes bilingual comparable corpora and presents a translator with a list of potential equivalents (in context) of the search term. The task of identifying translation equivalents in a non-aligned, non-translated corpus is a difficult one, and ExTrECC makes use of a number of techniques, some of which are simple and others more sophisticated. The basic design of ExTrECC (graphical user interface, architecture, algorithms) is outlined in this paper. | |
Creating and Using Domain-specific Ontologies for Terminological Applications | Huge volumes of scientific databases and text collections are constantly becoming available, but their usefulness is at present hampered by their lack of uniformity and structure. There is therefore an overwhelming need for tools to facilitate the processing and discovery of technical terminology, in order to make processing of these resources more efficient. Both NLP and statistical techniques can provide such tools, but they would benefit greatly from the availability of suitable lexical resources. While information resources do exist in some areas of terminology, these are not designed for linguistic use. In this paper, we investigate how one such resource, the UMLS, is used for terminological acquisition in the TRUCKS system, and how other domain-specific resources might be adapted or created for terminological applications. | |
The TREC-8 Question Answering Track | The TREC-8 Question Answering track was the first large-scale evaluation of domain-independent question answering systems. This paper summarizes the results of the track, including both an overview of the approaches taken to the problem and an analysis of the evaluation methodology. Retrieval results for the more stringent condition in which system responses were limited to 50 bytes showed that explicit linguistic processing was more effective than the bag-of-words approaches that are effective for document retrieval. The use of multiple human assessors to judge the correctness of the systems' responses demonstrated that assessors have legitimate differences of opinion as to correctness even for fact-based, short-answer questions. Evaluations of question answering technology will need to accommodate these differences since eventual end-users of the technology will have similar differences. | |
IREX: IR & IE Evaluation Project in Japanese | We will report on the IREX (Information Retrieval and Extraction Exercise) project. It is an evaluation-based project for Information Retrieval and Information Extraction in Japanese. The project started in May 1998 and concluded in September 1999 with the IREX workshop held in Tokyo with more than 150 attendance (IREX Commettee, 1999). There is a homepage of the project at (IREX, Homepage) and anyone can download almost all the data and the tools produced by the project for free. | |
Towards A Universal Tool For NLP Resource Acquisition | This paper describes an approach to developing a universal tool for eliciting, from a non-expert human user, knowledge about any language L. The purpose of this elicitation is rapid development of NLP systems. The approach is described on the example of the syntax module of the Boas knowledge elicitation system for a quick ramp up of a standard transfer-based machine translation system from L into English. The preparation of knowledge for the MT system is carried out into two stages; the acquisition of descriptive knowledge about L and using the descriptive knowledge to derive operational knowledge for the system. Boas guides the acquisition process using data-driven, expectation-driven and goal-driven methodologies. | |
The Multi-layer Language Knowledge Base of Chinese NLP | This paper introduced the effort to build a multi-layer knowledge base of Chinese NLP which combined with list-based, rule-based and corpus-based language information. Different kinds of information are designed to solve different kind of problems that encountered in the Chinese NLP. The whole knowledge base is designed with theoretical consistency and can easily be put into practice in the application systems. | |
With WORLDTREK Family, Create, Update and Browse your Terminological World | Companies need to extract pertinent and coherent information from large collections of documents to be competitive and efficient. Structured terminologies are essential for a better drafting, translation or understanding of technical communication. WORLDTREK EDITION is a tool created to help the terminologist elaborate, browse and update structured terminologies in a ergonomic environment without changing his or her working method. This application can be entirely adapted to the « terminological habits » of the expert. Thus, the data loaded in the software is meta-data. Links, status, property names and domains can be customized. Moreover, the validation stage is facilitated by the use of templates, queries and filters. New terms and links can be easily created to enrich the domains and points of view. Properties like definition, context, equivalent in foreign languages are associated with the terms. WORLDTREK EDITION facilitates the comparison and merging of pre-existing networks. All these tasks and the visualization techniques constitute the tool which will help the terminologist to be more effective and productive. | |
Etude et Evaluation de la Di-Syllabe comme Unite Acoustique pour le Systeme de Synthese Arabe PARADIS | L' etude que nous presentons dans cet article s' inscrit dans le cadre de la realisation d' un systeme de synthese de la parole a partir du texte pour la langue arabe. Notre systeme PARADIS est base sur la concatenation des di-syllabes avec TD-PSOLA comme technique de synthese. Nous presentons dans cet article l' interet du choix de la di-syllabe comme unite de concatenation pour le synthetiseur et son apport au niveau de la qualite de synthese. En effet, la di-syllabe permet d' ameliorer amplement la qualite de synthese et de reduire les problemes de discontinuite temporelle lors de la concatenation. Cependant, on est confronte a plusieurs problemes causes par la taille considerable de l' ensemble des di-syllabes et leur adaptation aux modeles prosodiques qui sont d' habitude associes a la syllabe comme unite rythmique. Nous decrivons alors le principe sur lequel nous nous sommes bases pour reduire le nombre de di-syllabes. Nous presentons ensuite la demarche que nous avons mise au point pour la generation et l' etiquetage automatique du dictionnaire de di-syllabes. Ainsi, nous avons choisi des logatomes ayant des formes particulierement appropriees a l' automatisation de la procedure de generation du corpus des logatomes et a l' operation de segmentation automatique. Par ailleurs, nous presentons une technique d' organisation du dictionnaire acoustique parfaitement adaptee a la forme de la di-syllabe arabe. | |
Dialogue Annotation for Language Systems Evaluation | The evaluation of Natural Language Processing (NLP) systems is still an open problem demanding further research progress from the research community to establish general evaluation frameworks. In this paper we present an experimental multilevel annotation process to be followed during the testing phase of Spoken Language Dialogue Systems (SLDSs). Based on this process we address some issues related to an annotation scheme of evaluation dialogue corpora and particular annotation tools and processes. | |
Evaluation of TRANSTYPE, a Computer-aided Translation Typing System: A Comparison of a Theoretical- and a User-oriented Evaluation Procedures | We describe and compare two protocols —one theoretical and the other in-situs —for evaluating the TRANSTYPE system, a target-text mediated interactive machine translation prototype which predicts in real time the words of the ongoing translation. | |
Extraction of Semantic Clusters for Terminological Information Retrieval from MRDs | This paper describes a semantic clustering method for data extracted from machine readable dictionaries (MRDs) in order to build a terminological information retrieval system that finds terms from descriptions of concepts. We first examine approaches based on ontologies and statistics, before introducing our analogy-based approach that lets us extract semantic clusters by aligning definitions from two dictionaries. Evaluation of the final set of clusters for a small set of definitions demonstrates the utility of our approach. | |
Obtaining Predictive Results with an Objective Evaluation of Spoken Dialogue Systems: Experiments with the DCR Assessment Paradigm | The DCR methodology is a framework that proposes a generic and detailed evaluation of spoken dialog systems. We have already detailed (Antoine et al., 1998) the theoretical bases of this paradigm. In this paper, we present some experimental results on spoken language understanding that show the feasibility and the reliability of the DCR evaluation as well as its ability to provide a detailed diagnosis of the system’s behaviour. Finally, we highlight the extension of the DCR methodology to dialogue management. | |
MHATLex: Lexical Resources for Modelling the French Pronunciation | The aim of this paper is to introduce the lexical resources and environment, called MHATLex, and intended for speech and text processing. A particular attention is paid to a pronunciation modelling which can be used in automatic speech processing as well as in phonological/phonetic description of languages. In our paper we will introduce a pronunciation model, the MHAT model (Markovian Harmonic Adaptation and Transduction), which copes with free and context-dependent variants. At the same time, we will present the MHATLex resources. They include 500,000 inflected forms and tools allowing the generation of various lexicons through phonological tables. Finally, some illustrations of the use of MHATLex in ASR will be shown. | |
Dialogue and Prompting Strategies Evaluation in the DEMON System | In order to improve usability and efficiency of dialogue systems a major issue is of better adapting dialogue systems to intended users. This requires a good knowledge of users’ behaviour when interacting with a dialogue system. With this regard we based evaluations of dialogue and prompting strategies performed on our system on how they influence users answers. In this paper we will describe the measure we used to evaluate the effect of the size of the welcome prompt and a set measures we defined to evaluate three different confirmation strategies. We will then describe five criteria we used to evaluate system’s question complexity and their effect on users’ answers. The overall aim is to design a set of metrics that could be used to automatically decide which of the possible prompts at a given state in a dialogue should be uttered. | |
SLR Validation: Present State of Affairs and Prospects | This paper deals with the quality evaluation (validation) and improvement of Spoken Language Resources (SLR). We discuss a number of aspects of SLR validation. We review the work done so far in this field. The most important validation check points and our view on their rank order are listed. We propose a strategy for validation and improvement of SLR that is presently considered at the European Language Resources Association, ELRA. And finally, we show some of our future plans in these directions. | |
EULER: an Open, Generic, Multilingual and Multi-platform Text-to-Speech System | The aim of the collaborative project presented in this paper is to obtain a set of highly modular Text-To-Speech synthesizers for as many voices, languages and dialects as possible, free for use in non-commercial and non-military applications. This project is an extension of the MBROLA project: MBROLA is a speech synthesizer, freely distributed for non-commercial purposes, which uses diphone databases provided by users (19 languages in year 2000). Euler extends this idea to whole TTS systems by providing a backbone structure (MLC) and several generic algorithms for POS tagging, grapheme-to-phoneme conversion, and prosody generation. To demonstrate the potentials of the architecture and draw developpers’ interest we provide a full EULER-based TTS in French and in Arabic. Euler currently runs on Windows and Linux, and it is an open project: many of its components (and certainly its kernel) are provided as GNU C++ sources. It also incorporates, as much as possible, components and data derived from other TTS-related projects. | |
On the Use of Prosody for On-line Evaluation of Spoken Dialogue Systems | This paper focuses on the users’ signaling of information status in human-machine interactions, and in particular looks at the role prosody may play in this respect. Using a corpus of interactions with two Dutch spoken dialogue systems, prosodic correlates of users’ discon-firmations were investigated. In this corpus, disconfirmations may serve as a signal to ‘go on’ in one context and as a signal to ‘go back’ in another. With the data obtained from this corpus an acoustic and a perception experiment have been carried out. The acoustic analysis shows that the difference in signaling function is reflected in the distribution of the various types of disconfirmations as well as in different prosodic variables (pause, duration, intonation contour and pitch range). The perception experiment revealed that subjects are very good at classifying disconfirmations as positive or negative signals (without context), which strongly suggests that the acoustic features have communicative relevance. The implications of these results for human-machine interactions are discussed. | |
A Word-level Morphosyntactic Analyzer for Basque | This work presents the development and implementation of a full morphological analyzer for Basque, an agglutinative language. Several problems (phrase structure inside word-forms, noun ellipsis, multiplicity of values for the same feature and the use of complex linguistic representations) have forced us to go beyond the morphological segmentation of words, and to include an extra module that performs a full morphosyntactic parsing of each word-form. A unification-based word-level grammar has been defined for that purpose. The system has been integrated into a general environment for the automatic processing of corpora, using TEI-conformant SGML feature structures. | |
The EUDICO Project, Multi Media Annotation over the Internet | In this paper we dsecribe a software environment that facilitates media annotation and analysis of media related corpora over the internet. We will describe the general architecture of this environment and we will introduce our Abstract Corpus Model with which we isolate corpora specific formats from the annotation and analysis tools. The main set of tools is described by giving examples of their usage. Finally we will discuss features regarding the distributed character of this environment. | |
Towards a Strategy for a Representation of Collocations - Extending the Danish PAROLE-lexicon | We describe our attempts to formulate a pragmatic definition and a partial typology of the lexical category of ’collocation’ taking both lexicographical and computational aspects into consideration. This provides a suitable basis for encoding collocations in an NLP-lexicon. Further, this paper explains the principles of an operational encoding strategy which is applied to a core section of the typology, namely to subtypes of verbal collocation. This strategy is adapted to a pre-defined lexicon model which has been developed in the PAROLE-project. The work is carried out within the framework of the STO-project the aim of which is to extend the Danish PAROLE-lexicon. The encoding of collocations, in addition to single-word lemmas, greatly increases the lexical and linguistic coverage and thereby also the usability of the lexicon as a whole. Decisions concerning the selection of the most frequent types of collocation to be encoded are made on empirical data i.e. corpus-based recognition. We present linguistic descriptions with focus on some characteristic syntactic features of collocations that are observed in a newspaper corpus. We then give a few prototypical examples provided with formalised descriptions in order to illustrate the restriction features. Finally, we discuss the perspectives of the work done so far. | |
Perceptual Evaluation of Text-to-Speech Implementation of Enclitic Stress in Greek | This paper presents a perceptual evaluation of a text to speech (TTS) synthesizer in Greek with respect to acoustic registration of enclitic stress and related naturalness and intelligibility. Based on acoustical measurements and observations of naturally recorded utterances, the corresponding output of a commercially available formant-based speech synthesizer was altered and the results were subjected to perceptual evaluation. Pitch curve, intensity, and duration of the syllable bearing enclitic stress, were acoustically manipulated, while a phonetically identical phrase contrasting only in stress served as control stimulus. Ten listeners judged the perceived naturalness and preference (in pairs) and the stress pattern of each variant of a base phrase. It was found that intensity modification adversely affected perceived naturalness while increasing perceived stress prominence. Duration modification had no appreciable effect. Pitch curve modification tended to produce an improvement in perceived naturalness and preference but the results failed to achieve statistical significance. The results indicated that the current prosodic module of the speech synthesizer reflects a good balance between prominence of stress assignment, intelligibility, and naturalness. | |
Creation of Spoken Hebrew Databases | Two Spoken Hebrew databases were collected over fixed telephone lines at NSC - Natural Speech Communication. Their creation was based on the SpeechDat model, and represents the first comprehensive spoken database in Modern Hebrew that can be successfully applied to the teleservices industry. The speakers are a representative sample of Israelis, based on sociolinguistic factors such as age, gender, years of education and country of origin. The database includes, digit sequences, natural numbers, money amounts, time expressions, dates, spelled words, application words and phrases for teleservices (e.g., call, save, play), phonetically rich words, phonetically rich sentences, and names. Both read speech and spontaneous speech were elicited. | |
PLEDIT - A New Efficient Tool for Management of Multilingual Pronunciation Lexica and Batchlists | The program tool PLEDIT - Pronunciation Lexica Editor - has been created for efficient handling with pronunciation lexica and batchlists. PLEDIT is designed as a GUI, which incorporates tools for fast and efficient management of pronunciation lexica and batchlists. The tool is written in cl/Tk/Tix and can thus be easily ported to different platforms. PLEDIT supports three lexicon format types, which are Siemens, SpeechDat and CMU lexicon formats. PLEDIT enables full editing capability for lexica and batchlists and supports work with multilingual resources. Some functions have been built in as external programs written in the C program language. With these external programs higher speed and efficiency of the PLEDIT have been achieved. | |
Use of Greek and Latin Forms for Term Detection | It is well known that many languages make use of neo-classical compounds, and that some domains with a very long tradition like medicine made an intense use of such morphemes. This phenomenon has been largely studied for different languages with the common result that a relatively short number of morphemes allows the detection of a high number of specialised terms to be produced. We believe that the use of such morphological knowledge may help a term detector in discovering very specialised terms. In this paper we propose a module to be included in a term extractor devoted specifically to detect terms that include neo-classical compounds. We describe such module as well the results obtained from it. | |
Methods and Metrics for the Evaluation of Dictation Systems: a Case Study | This paper describes the practical evaluation of two commercial dictation systems in order to assess the potential usefulness of such technology in the specific context of a translation service translating legal text into Italian. The service suffers at times from heavy workload, lengthy documents and short deadlines. Use of dictation systems accepting continuous speech might improve productivity at these times. Design and execution of the evaluation followed the methodology worked out by the EAGLES Evaluation Working Group. The evaluation therefore also constitutes a test bed application of this methodology. | |
Cairo: An Alignment Visualization Tool | While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system. We developed Cairo to fill this need. Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters. This program can be easily adapted for visualization of correspondences in bi-texts based on probability distributions. | |
An XML-based Representation Format for Syntactically Annotated Corpora | This paper discusses a general approach to the description and encoding of linguistic corpora annotated with hierarchically structured syntactic information. A general format can be motivated by the variety and incompatibility of existing annotation formats. By using XML as a representation format the theoretical and technical problems encountered can be overcome. | |
An Experiment of Lexical-Semantic Tagging of an Italian Corpus | The availability of semantically tagged corpora is becoming a very important and urgent need for training and evaluation within a large number of applications but also they are the natural application and accompaniment of semantic lexicons of which they constitute both a useful testbed to evaluate their adequacy and a repository of corpus examples for the attested senses. It is therefore essential that sound criteria are defined for their construction and a specific methodology is set up for the treatment of various semantic phenomena relevant to this level of description. In this paper we present some observations and results concerning an experiment of manual lexical-semantic tagging of a small Italian corpus performed within the framework of the ELSNET project. The ELSNET experimental project has to be considered as a feasibility study. It is part of a preparatory and training phase, started with the Romanseval/Senseval experiment (Calzolari et al., 1998), and ending up with the lexical-semantic annotation of larger quantities of semantically annotated texts such as the syntactic-semantic Treebank which is going to be annotated within an Italian National Project (SI-TAL). Indeed, the results of the ELSNET experiment have been of utmost importance for the definition of the technical guidelines for the lexical-semantic level of description of the Treebank. | |
SIMPLE: A General Framework for the Development of Multilingual Lexicons | The project LE-SIMPLE is an innovative attempt of building harmonized syntactic-semantic lexicons for 12 European languages, aimed at use in different Human Language Technology applications. SIMPLE provides a general design model for the encoding of a large amount of semantic information, spanning from ontological typing, to argument structure and terminology. SIMPLE thus provides a general framework for resource development, where state-of-the-art results in lexical semantics are coupled with the needs of Language Engineering applications accessing semantic information. | |
Electronic Language Resources for Polish: POLEX, CEGLEX and GRAMLEX | We present theoretical results and resources obtained within three projects: national project POLEX, Copernicus 1 Project CEGLEX (1032) and Copernicus Project GRAMLEX (632). Morphological resources obtained within these projects contribute to fill-in the gap on the map of available electronic language resources for Polish. After a short presentation of some common methodological bases defined within the POLEX project, we proceed to present methodology and data obtained in CEGLEX and GRAMLEX projects. The intention of the Polish language part of CEGLEX was to test formats proposed by the GENELEX project against Polish data. The aim of the GRAMLEX project was to create a corpus-based morphological resources for Polish. GRAMLEX refers directly to the morphological part of the CEGLEX project. Large samples of data presented here are accessible at http://main.amu.edu.pl/~zlisi/projects.htm. | |
SPEECON - Speech Data for Consumer Devices | SPEECON, launched in February 2000, is a project focusing on collecting linguistic data for speech recogniser training. Put into action by an industrial consortium, it promotes the development of voice controlled consumer applications such as television sets, video recorders, audio equipment, toys, information kiosks, mobile phones, palmtop computers and car navigation kits. During the lifetime of the project, scheduled to last two years, partners will collect speech data for 18 languages or dialectal zones, including most of the languages spoken in the EU. Attention will also be devoted to research into the environment of the recordings, which are, like the typical surroundings of CE applications, at home, in the office, in public places or in moving vehicles. The following pages will give a brief overview of the workplan for the months to come. | |
A Treebank of Spanish and its Application to Parsing | This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction. | |
End-to-End Evaluation of Machine Interpretation Systems: A Graphical Evaluation Tool | VERBMOBIL as a long-term project of the Federal Ministry of Education, Science, Research and Technology aims at developing a mobile translation system for spontaneous speech. The source-language input consists of human speech (English, German or Japanese), the translation (bidirectional English-German and Japanese-German) and target-language output is effected by the VERBMOBIL system. As to the innovative character of the project new methods for end-to-end evaluation had to be developed by a subproject which has been established especially for this purpose. In this paper we present criteria for the evaluation of speech-to-speech translation systems and a tool for judging the translation quality which is called Graphical Evaluation Tool (GET)2 . | |
A Proposal for the Integration of NLP Tools using SGML-Tagged Documents | In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD. | |
A Bilingual Electronic Dictionary for Frame Semantics | Frame semantics is a linguistic theory which is currently gaining ground. The creation of lexical entries for a large number of words presupposes the development of complex lexical acquisition techniques in order to identify the vocabulary for describing the elements of a 'frame'. In this paper, we show how a lexical-semantic database compiled on the basis of a bilingual (English-French) dictionary can be used to identify some general frame elements which are relevant in a frame-semantic approach such as the one adopted in the FrameNet project (Fillmore & Atkins 1998, Gahl 1998). The database has been systematically enriched with explicit lexical-semantic relations holding between some elements of the microstructure of the dictionary entries. The manifold relationships have been labelled in terms of lexical functions, based on Mel'cuk's notion of co-occurrence and lexical-semantic relations in Meaning-Text Theory (Mel'cuk et al. 1984). We show how these lexical functions can be used and refined to extract potential realizations of frame elements such as typical instruments or typical locatives, which are believed to be recurrent elements in a large number of frames. We also show how the database organization of the computational lexicon makes it possible to readily access implicit and translationally-relevant combinatorial information. | |
The Evaluation of Systems for Cross-language Information Retrieval | We describe the creation of an infrastructure for the testing of cross-language text retrieval systems within the context of the Text REtrieval Conferences (TREC) organised by the US National Institute of Standards and Technology (NIST). The approach adopted and the issues that had to be taken into consideration when building a multilingual test suite and developing appropriate evaluation procedures to test cross-language systems are described. From 2000 on, a cross-language evaluation activity for European languages known as CLEF (Cross-Language Evaluation Forum) will be coordinated in Europe, while TREC will focus on Asian languages. The implications of the move to Europe and the intentions for the future are discussed. | |
Spoken Portuguese: Geographic and Social Varieties | The Spoken Portuguese: Geographic and Social Varieties project has as its main goal the Portuguese teaching as foreign language. The idea is to provide a collection of authentic spoken texts and to make it friendly usable. Therefore, a selection of spontaneous oral data was made, using either already compiled material or material recorded for this purpose. The final corpus constitution resulted in a representative sample that includes European, Brazilian and African Portuguese, as well as Macau and East-Timor Portuguese. In order to accomplish a functional product the Linguistics Center of Lisbon University developed a sound/text alignment software. The final result is a CD-ROM collection that contains 83 text files, 83 sound files and 83 files produced by the sound/text alignment tool. This independence between sound and text files allows the CD-ROM user to manipulate it for other purposes than the educational one. | |
Portuguese Corpora at CLUL | The Corpus de Referencia do Portugues Contemporaneo (CRPC) is being developed in the Centro de Linguistica da Universidade de Lisboa (CLUL) since 1988 under a perspective of research data enlargement, in the sense of concepts and hypothesis verification by rejecting the sole use of intuitive data. The intention of creating this open corpus is to establish an on-line representative sample collection of general usage contemporary Portuguese: a main corpus of great dimension as well as several specialized corpora. The CRPC has nowadays around 92 million words. Following the use in this area, the CRPC project intends to establish a linguistic database accessible to everyone interested in making theoretical and practical studies or applications. The Dialectal oral corpus of the Atlas Linguistico-Etnografico de Portugal e da Galiza (ALEPG) is constituted by approximately 3500 hours of speech collected by the CLUL Dialectal Studies Research Group and recorded in analogic audio tape. This corpus contains mainly directed speech: answers to a linguistic questionnaire essentially lexical, but also focusing on some phonetic and morpho-phonological phenomena. An important part of spontaneous speech enables other kind of studies such as syntactic, morphological or phonetic ones. | |
Reusing the Mikrokosmos Ontology for Concept-based Multilingual Terminology Databases | This paper reports work carried out within a multilingual terminology project (OncoTerm) in which the Mikrokosmos ( µK) ontology (Mahesh, 1996; Viegas et al 1999) has been used as a language independent conceptual structure to achieve a truly concept-based terminology database (termbase, for short). The original ontology, containing nearly 4,700 concepts and available in Lisp-like format (January 1997 version), was first converted into a set of tables in a relational database. A specific software tool was developed in order to edit and browse this resource. This tool has now been integrated within a termbase editor and released under the name of OntoTerm™. In this paper we focus on the suitability of the µK ontology for the representation of domain-specific knowledge and its associated lexical items. | |
Abstraction of the EDR Concept Classification and its Effectiveness in Word Sense Disambiguation | The relation between the degree of abstraction of a concept and the explanation capability (validity and coverage) of conceptual description which is the constraint held between concepts is clarified experimentally by performing the operation called concept abstraction. This is the procedure that chooses a set certain of lower level concepts in a concept hierarchy and maps the set to one or more upper level (abstract) concepts. We took the three abstraction techniques of a flat depth, a flat size, and a flat probability method for the degree of abstraction. By taking these methods and degrees as a parameter, we applied the concept abstraction to the EDR Concept Classifications and performed word sense disambiguation test. The test set and the disambiguation knowledge were extracted as a co-occurrence expression from the EDR Corpora. Through the test, we found that the flat probability method gives the best result. We also carried out an evaluation by comparing the abstracted hierarchy with that of human introspection and found the flat size method gives the most similar results to human. These results would contribute to clarify the appropriate detailed-ness of a concept when given an application purpose of a concept hierarchy. | |
Will Very Large Corpora Play For Semantic Disambiguation The Role That Massive Computing Power Is Playing For Other AI-Hard Problems? | In this paper we formally analyze the relation between the amount of (possibly noisy) examples provided to a word-sense classification algorithm and the performance of the classifier. In the first part of the paper, we show that Computational Learning Theory provides a suitable theoretical framework to establish one such relation. In the second part of the paper, we will apply our theoretical results to the case of a semantic disambiguation algorithm based on syntactic similarity. | |
Guidelines for Japanese Speech Synthesizer Evaluation | Speech synthesis technology is one of the most important elements required for better human interfaces for communication and information systems.This paper describes the ''Guidelines for Speech Synthesis System Performance Evaluation Methods''created by the Speech Input/Output Systems Expert Committee of the Japan Electronic Industry Development Association (JEIDA).JEIDA has been investigating speech synthesizer evaluation methods since 1993 and previously reported the provisional version of the guidelines. The guidelines comprise six chapters: General rules,Text analysis evaluation,Syllable articulation test,Word intelligibility test, Sentence intelligibility test,and Over ll quality evaluation. | |
Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts | This paper presents how we constructed a tagged E-J parallel corpus of sample abstracts, which is the core language resource for our English abstract writing tool, the “Abstract Helper.” This writing tool is aimed at helping Japanese software engineers be more productive in writing by providing them with good models of English abstracts. We collected 539 English abstracts from technical journals/proceedings and prepared their Japanese translations. After analyzing the rhetorical structure of these sample abstracts, we tagged each sample abstract with both an abstract type and an organizational-scheme type. We also tagged each sample sentence with a sentence role and one or more verb complementation patterns. We also show that our tagged E-J parallel corpus of sample abstracts can be effectively used for providing users with both discourse-level guidance and sentence-level assistance. Finally, we discuss the outlook for further development of the “Abstract Helper.” | |
Extraction of Unknown Words Using the Probability of Accepting the Kanji Character Sequence as One Word | In this paper, we propose a method to extract unknown words, which are composed of two or three kahji characters, from Japanase text. Generally the known word composed of kanji characters are segmented into other words by the morphological analysis. Moreover, the appearance probability of each segmented word is small. By these features, we can define the measure of accepting two or three kanji character sequence as an unknown word. On the other hand, we can find some segmentation patterns of unknown words. By applying our measure to kanji character sequences which have these patterns, we can extract unknown words. In the experiment, the F-measuer for extraction of known words composed of two and three kanji characters was about 0.7 and 0.4 respectively. Our method does not need to use the frequency of the word in the training corpus to judge whether its word is the unknown word or not. Therefore, our method has the advantage that low frequent unknown words are extracted. | |
Automatic Speech Segmentation in High Noise Condition | The accurate segmentation of speech and end points detection in adverse condition is very important for building robust automatic speech recognition (ASR) systems. Segmentation of speech is not a trivial process - in high noise conditions it is very difficult to determine weak fricatives and nasals at end of the words. An efficient threshold (a priory defined) independent speech segmentation algorithm, robust to level of disturbance signals, is developed. The results show a significant improvement of robustness of proposed algorithm with respect to traditional algorithms. | |
Open Ended Computerized Overview of Controlled Languages | We have built up an open-ended computerized overview which can give instant access to information because controlled languages (CLs) are of undoubted interest (for safety and economic reasons, etc.) for industry and those willing to create a CL need to be aware of what has already been done. To achieve it, we had a close look at what has been written in the field of CLs and tried to get in touch with the persons involved in different projects (K. Barthe, E. Johnson, K. Godden, B. Arendse, E. Adolphson, T. Hartley, etc.) | |
Shallow Parsing and Functional Structure in Italian Corpora | In this paper we argue in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution - 50% or less of unambiguous tags when punctuation is subtracted from the total count. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. The current work includes a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora. In a preliminary experiment we made with automatic tagger, we obtained 99,97% accuracy in the training set and 99,03% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging. As to the shallow parser and GF-assigner we shall report on a first preliminary experiment on a manually verified subset made of 10,000 words. | |
Annotating, Disambiguating & Automatically Extending the Coverage of the Swedish SIMPLE Lexicon | During recent years the development of high-quality lexical resources for real-world Natural Language Processing (NLP) applications has gained a lot of attention by many research groups around the world, and the European Union, through the promotion of the language engineering projects dealing directly or indirectly with this topic. In this paper, we focus on ways to extend and enrich such a resource, namely the Swedish version of the SIMPLE lexicon in an automatic manner. The SIMPLE project ({\it Semantic Information for Multifunctional Plurilingual Lexica}) aims at developing wide-coverage semantic lexicons for 12 European languages, though on a rather small scale for practical NLP, namely less than 10,000 entries. Consequently, our intention is to explore and exploit various (inexpensive) methods to progressively enrich the resources and, subsequently, to annotate texts with the semantic information encoded within the framework of SIMPLE, and enhanced with the semantic data from the {\it Gothenburg Lexical DataBase} (GLDB) and from large corpora. | |
Providing Internet Access to Portuguese Corpora: the AC/DC Project | In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do portugues) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilizacao de Corpora, roughly ''Access and Availability of Corpora'') allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese. | |
Turkish Electronic Living Lexicon (TELL): A Lexical Database | The purpose of the TELL project is to create a database of Turkish lexical items which reflects actual speaker knowledge, rather than the normative and phonologically incomplete dictionary representations on which most of the existing phonological literature on Turkish is based. The database, accessible over the internet, should greatly enhance phonological, morphological, and lexical research on the language. The current version of TELL consists of the following components: • Some 15,000 headwords from the 2d and 3d editions of the Oxford Turkish-English dictionary, orthographically represented. • Proper names, including 175 place names from a guide of Istanbul, and 5,000 place names from a telephone area code directory of Turkey. • Phonemic transcriptions of the pronunciations of the same headwords and place names embedded in various morphological contexts. (Eliciting suffixed forms along with stems exposes any morphophonemic alternations that the headwords in question are subject to.) • Etymological information, garnered from a variety of etymological sources. • Roots for a number of morphologically complex headwords. The paper describes the construction of the current structure of the TELL database, points out potential questions that could be addressed by putting the database into use, and specifies goals for the next phase of the project. | |
Orthographic Transcription of the Spoken Dutch Corpus | This paper focuses on the specification of the orthographic transcription task in the Spoken Dutch Corpus, the problems encountered in making that specification and the evaluation experiments that were carried out to assess the transcription efficiency and the inter-transcriber consistency. It is stated that the role of the orthographic transcriptions in the Spoken Dutch Corpus is twofold: on the one hand, the transcriptions are important for future database users, on the other hand they are indispensable to the development of the corpus itself. The main objectives of the transcription task are the following: (1) to obtain a verbatim transcription that can be made with a minimum level of interpretation of the utterances; (2) to obtain an alignment of the transcription to the speech signal on the level of relatively short chunks; (3) to obtain a transcription that is useful to researchers working in several research areas and (4) to adhere to international standards for existing large speech corpora. In designing the transcription protocol and transcription procedure it was attempted to establish the best compromise between consistency, accuracy and usability of the output and efficiency of the transcription task. For example, the transcription procedure always consists of a first transcription cycle and a verification cycle. Some efficiency and consistency statistics derived from pilot experiments with several students transcribing the same material are presented at the end of the paper. In these experiments the transcribers were also asked to record the amount of time they spent on the different audio files, and to report difficulties they encountered in performing their task. | |
Development of Acoustic and Linguistic Resources for Research and Evaluation in Interactive Vocal Information Servers | This paper describes the setting up of a resource database for research and evaluation in the domain of interactive vocal information servers. All this resource development work took place in a research project aiming at the development of an advanced speech recognition system for the automatic processing of telephone directory requests and was performed on the basis of the Swiss-French Polyphone database (collected in the framework of the European SpeechDat project). Due to the unavailability of a properly orthographically transcribed, consistently labeled and tagged database of unconstrained speech (together with its associated lexicon) for the targeted area, we first concentrated on the annotation and structuration of the spoken requests data in order to make it profitable for lexical and linguistic modeling and for the evaluation of recognition results. A baseline speech recognition system was then trained on the newly developed resources and tested. Preliminary recognition experiments showed a relative improvement of 46% for the Word Error Rate (WER) compared to the results previously obtained with a baseline system very similar but working on the unconsistent natural speech database that was originally available. | |
An Architecture for Document Routing in Spanish: Two Language Components, Pre-processor and Parser | This paper describes the language components of a system for Document Routing in Spanish. The system identifies relevant terms for classification within involved documents by means of natural language processing techniques. These techniques are based on the isolation and normalization of syntactic unities considered relevant for the classification, especially noun phrases, but also other constituents built around verbs, adverbs, pronouns or adjectives. After a general introduction about the research project, the second Section relates our approach to the problem with other previous and current approaches, the third one describes corpora used for evaluating the system. The linguistic analysis architecture, including pre-processing and two different levels of syntactic analysis, is described in following fourth and fifth Sections, while the last one is dedicated to a comparative analysis of results obtained from the processing of corpora introduced in third Section. Certain future developments of the system are also included in this Section. | |
Target Suites for Evaluating the Coverage of Text Generators | Our goal is to evaluate the grammatical coverage of the surface realization component of a natural language generation system by means of target suites. We consider the utility of re-using for this purpose test suites designed to assess the coverage of natural language analysis / understanding systems. We find that they are of some interest, in helping inter-system comparisons and in providing an essential link to annotated corpora. But they have limitations. First, they contain a high proportion of ill-formed items which are inappropriate as targets for generation. Second, they omit phenomena such as discourse markers which are key issues in text production. We illustrate a partial remedy for this situation in the form of a text generator that annotates its own output to an externally specified standard, the TSNLP scheme. | |
LT TTT - A Flexible Tokenisation Tool | We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation | |
Perception and Analysis of a Reiterant Speech Paradigm: a Functional Diagnostic of Synthetic Prosody | A set of perception experiments,using reiterant speech,were designed to carry out a diagnostic of the segmentation /hierarchisation linguistic function of prosody.The prosodic parameters of F0,syllabic duration and intensity of the stimuli used during this experiment were extracted.Several dissimilarity measures (Correlation,root-mean-square distance and mutual information)were used to match the results of the subjective experiment.This comparison of the listeners ’perception with acoustic parameters is intended to underline the acoustic keys used by listeners to judge the adequacy of prosody to perform a given linguistic function. | |
Development and Evaluation of an Italian Broadcast News Corpus | This paper reports on the development and evaluation of an Italian broadcast news corpus at ITC-irst, under a contract with the Euro-pean Language resources Distribution Agency (ELDA). The corpus consists of 30 hours of recordings transcribed and annotated with conventions similar to those adopted by the Linguistic Data Consortium for the DARPA HUB-4 corpora. The corpus will be completed and released to ELDA by April 2000. | |
Multilingual Linguistic Resources: From Monolingual Lexicons to Bilingual Interrelated Lexicons | This paper describes a procedure to convert the PAROLE-SIMPLE monolingual lexicons into bilingual interrelated lexicons where each word sense of a given language is linked to the pertinent sense of the right words in one or more target lexicons. Nowadays, SIMPLE lexicons are monolingual although the ultimate goal of these harmonised monolingual lexicons is to build multilingual lexical resources. For achieving this goal it is necessary to automatise the linking among the different senses of the different monolingual lexicons, as the production of such multilingual relations by hand will be, as all tasks related with the development of linguistic resources, unaffordable in terms of human resources and time spent. The system we describe in this paper takes advantage of the SIMPLE model and the SIMPLE based lexicons so that, in the best case, it can find fully automatically the relevant sense-to-sense correspondences for determining the translational equivalence of two words in two different languages and, in the worst case, it will be able to narrow the set of admissible links between words and relevant senses. This paper also explores to what extent semantic encoding in already existing computational lexicons such as SIMPLE can help in overcoming the problems arisen when using monolingual meaning descriptions for bilingual links and aims to set the basis for defining a model for adding a bilingual layer to the SIMPLE model. This bilingual layer based on a bilingual relation model will be the basis indeed for defining the multilingual language resource we want PAROLE-SIMPLE lexicons to become. | |
Where Opposites Meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation | The paper describes the use of FAME, a functional annotation meta–scheme for comparison and evaluation of syntactic annotation schemes, i) as a flexible yardstick in multi–lingual and multi–modal parser evaluation campaigns and ii) for corpus annotation. We show that FAME complies with a variety of non–trivial methodological requirements, and has the potential for being effectively used as an “interlingua” between different syntactic representation formats. | |
Controlled Bootstrapping of Lexico-semantic Classes as a Bridge between Paradigmatic and Syntagmatic Knowledge: Methodology and Evaluation | Semantic classification of words is a highly context sensitive and somewhat moving target, hard to deal with and even harder to evaluate on an objective basis. In this paper we suggest a step–wise methodology for automatic acquisition of lexico–semantic classes and delve into the non trivial issue of how results should be evaluated against a top–down reference standard. | |
Coreference Annotation: Whither? | The terms coreference and anaphora tend to be used inconsistently and interchangeably in much empirically-oriented work in NLP, and this threatens to lead to incoherent analyses of texts and arbitrary loss of information. This paper discusses the role of coreference annotation in Information Extraction, focussing on the coreference scheme defined for the MUC-7 evaluation exercise. We point out deficiencies in that scheme and make some suggestions towards a new annotation philosophy. | |
Evaluation of a Dialogue System Based on a Generic Model that Combines Robust Speech Understanding and Mixed-initiative Control | This paper presents a generic model to combine robust speech understanding and mixed-initiative dialogue control in spoken dialogue systems. It relies on the use of semantic frames to conceptually store user interactions, a frame-unification procedure to deal with partial information, and a stack structure to handle initiative control. This model has been successfully applied in a dialogue system being developed at our lab, named SAPLEN, which aims to deal with the telephone-based product orders and queries of fast food restaurants’ clients. In this paper we present the dialogue system and describe the new model, together with the results of a preliminary evaluation of the system concerning recognition time, word accuracy, implicit recovery and speech understanding. Finally, we present the conclusions and indicate possibilities for future work. | |
MDWOZ: A Wizard of Oz Environment for Dialog Systems Development | This paper describes MDWOZ, a development environment for spoken dialog systems based on the Wizard of Oz technique, whose main goal is to facilitate data collection (speech signal and dialog related information) and interaction model building. Both these tasks can be quite difficult, and such an environment can facilitate them very much. Due to the modular way in which MDWOZ was implemented, it is possible to reuse parts of it in the final dialog system. The environment provides language-transparent facilities and accessible methods such that even non-computing specialists can participate in spoken dialog systems development. The main features of the environment are presented, together with some test experiments. | |
A Web-based Text Corpora Development System | One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available. | |
Term-based Identification of Sentences for Text Summarisation | The present paper describes a methodology for automatic text summarisation of Greek texts which combines terminology extraction and sentence spotting. Since generating abstracts has proven a hard NLP task of questionable effectiveness, the paper focuses on the production of a special kind of abstracts, called extracts: sets of sentences taken from the original text. These sentences are selected on the basis of the amount of information they carry about the subject content. The proposed, corpus-based and statistical approach exploits several heuristics to determine the summary-worthiness of sentences. It actually uses statistical occurrences of terms (TF· IDF formula) and several cue phrases to calculate sentence weights and then extract the top scoring sentences which form the extract. | |
Morphemic Analysis and Morphological Tagging of Latvian Corpus | There are approximately 8 million running words in Latvian Corpus and it is initial size for investigations using national corpus. The corpus contains different texts: modern written Latvian, different newspapers, Latvian classical literature, Bible, Latvian Folk Believes, Latvian Folk Songs, Latvian Fairy-tales and other. Methodology and the software for SGML tagging are developed by Artificial Intelligence Laboratory; approximately 3 million running words is marked up by SGML language. The first step was to develop morphemic analysis in co-operation with Dr. B. Kangere from Stockholm University. The first morphological analyzer was developed in 1994 at Artificial Intelligence Laboratory. The analyzer has its own tag system. Later the tags for the morphological analyzer were elaborated according to MULTEXT-EAST recommendations. Latvian morphological system is rather complicate and there are many difficulties with the recognition of words, word forms as far as Latvian has many homonymous forms. The first corpus of texts of morphological analysis is marked up manually. Totally it covers approximately 10 000 words of modern written Latvian. The results of this work will be used in the further investigations. | |
Textual Information Retrieval Systems Test: The Point of View of an Organizer and Corpuses Provider | Amaryllis is an evaluation programme for text retrieval systems which has been carried out as two test campaigns. The second Amaryllis campaign took place in 1998/1999. Corpuses of documents, topics, and the corresponding responses were first sent to each of the participating teams for system learning purposes. Corpuses of new documents and a set of new topics were then supplied for evaluation purposes. Two optional tracks were added for Internet and interlingual track. The first track of these contained a test via the Internet. INIST sent topics to the system and collected responses directly, thus reducing the need for conceptor manipulations. The second contained tests in different European Community language pairs. The corpuses of documents consisted of records of questions and answers from the European Commission, in parallel official language versions. Participants could use any language pair for their tests. The aim of this paper is to give the point of view of an organizer and corpus provider (INIST) on the organization of an operation of this sort. In particular, it will describe the difficulties encountered during the tests (corpus construction, translation of topics and systems evaluation ), and will suggest avenues to explore for future tests. | |
The Spoken Dutch Corpus. Overview and First Evaluation | In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, its aims, structure and organization. It then goes on to discuss the considerations - both methodological and practical - that have played a role in the design of the corpus as well as in its compilation and annotation. The paper concludes with an account of the data that are available in the first release of the first part of the corpus that came out on March 1st, 2000. | |
A Strategy for the Syntactic Parsing of Corpora: from Constraint Grammar Output to Unification-based Processing | This paper presents a strategy for syntactic analysis based on the combination of two different parsing techniques: lexical syntactic tagging and phrase structure syntactic parsing. The basic proposal is to take advantage of the good results on lexical syntactic tagging to improve the whole performance of unification-based parsing. The syntactic functions attached to every word by the lexical syntactic tagging are used as head features in the unification-based grammar, and are the base for grammar rules. | |
Producing LRs in Parallel with Lexicographic Description: the DCC project | This paper is a brief presentation of some aspects of the most important lexicographical project that is being carried out in Catalonia: the DCC (Dictionary of Contemporary Catalan) project. After making a general description of the aims of the project, the specific goal of my contribution is to present the general strategy of our lexicographical description, consisting in the production of an electronic dictionary able to be the common repository from which we will obtain different derived products (the human dictionary, among them). My concern is to show to which extent human and computer lexicography can share descriptions, and the results of lexicographic work can be taken as a language resource in this new perspective. I will present different aspects and criteria of our dictionary, taking the different layers (morphology, syntax, semantics) as a guideline. | |
A Novelty-based Evaluation Method for Information Retrieval | In information retrieval research, precision and recall have long been used to evaluate IR systems. However, given that a number of retrieval systems resembling one another are already available to the public, it is valuable to retrieve novel relevant documents, i.e., documents that cannot be retrieved by those existing systems. In view of this problem, we propose an evaluation method that favors systems retrieving as many novel documents as possible. We also used our method to evaluate systems that participated in the IREX workshop. | |
Towards More Comprehensive Evaluation in Anaphora Resolution | The paper presents a package of evaluation tasks for anaphora resolution. We argue that these newly added tasks which have been carried out on Mitkov's (1998) knowledge-poor, robust approach, provide a better picture of the performance of an anaphora resolution system. The paper also outlines future work on the development of a 'consistent' evaluation environment for anaphora resolution. | |
Galaxy-II as an Architecture for Spoken Dialogue Evaluation | The GALAXY-II architecture, comprised of a centralized hub mediating the interaction among a suite of human language technology servers, provides both a useful tool for implementing systems and also a streamlined way of configuring the evaluation of these systems. In this paper, we discuss our ongoing efforts in evaluation of spoken dialogue systems, with particular attention to the way in which the architecture facilitates the development of a variety of evaluation configurations. We furthermore propose two new metrics for automatic evaluation of the discourse and dialogue components of a spoken dialogue system, which we call “user frustration” and “information bit rate.” | |
Building the Croatian-English Parallel Corpus | The contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected in the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After quick survey of existing English-Croatian parallel corpora, the article copes with procedures involved in text conversion and text encoding, particularly the alignment. There are several recent suggestions for alignment encoding and they are elaborated. Preliminary statistics on numbers of S and W elements in each language is given at the end of the article. | |
Lexical and Translation Equivalence in Parallel Corpora | In the present paper we intend to investigate to what extent use of parallel corpora can help to eliminate some of the difficulties noted with bilingual dictionaries. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give and the preposition with. George Orwell's novel 1984 was used as source material, which is available in English-Hungarian sentence aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology. | |
Towards a Standard for Meta-descriptions of Language Resources | The desire is to improve the availability of Language Resources (LR) on the Intra- and Internet. It is suggested that this can be achieved by creating a browsable & searchable universe of meta-descriptions. This asks for the development of a standard for tagging LRs with meta-data and several conventions agreed within the community. | |
Object-oriented Access to the Estonian Phonetic Database | The paper introduces the Estonian Phonetic Database developed at the Laboratory of Phonetics and Speech Technology of the Institute of Cybernetics at the Tallinn Technical University, and its integration into QuickSig – an object-oriented speech processing environment developed at the Acoustics Laboratory of the Helsinki University of Technology. Methods of database access are discussed, relations between different speech units – sentences, words, phonemes – are defined, examples of predicate functions are given to perform searches for different contexts, and the advantage of an object-oriented paradigm is demonstrated. The introduced approach has been proven to be a flexible research environment allowing studies to be performed in a more efficient way. | |
ItalWordNet: a Large Semantic Database for Italian | The focus of this paper is on the work we are carrying out to develop a large semantic database within an Italian national project, SI-TAL, aiming at realizing a set of integrated (compatible) resources and tools for the automatic processing of the Italian language. Within SI-TAL, ItalWordNet is the reference lexical resource which will contain information related to about 130,000 word senses grouped into synsets. This lexical database is not being created ex novo, but extending and revising the Italian lexical wordnet built in the framework of the EuroWordNet project. In this paper we firstly describe how the lexical coverage of our wordnet is being extended by adding adjectives, adverbs and proper nouns, plus a terminological subset belonging to the economic and financial domain. The relevant changes involved by these extensions both in the linguistic model and in the data structure are then illustrated. In particular we discuss i) the new semantic relations identified to encode information on adjectives and adverbs ii) the new architecture including the terminological subset. | |
FAST - Towards a Semi-automatic Annotation of Corpora | As the use of annotated corpora in natural language processing applications increases, we are aware of the necessity of having flexible annotation tools that would not only support the manual annotation, but also enable us to perform post-editing on a text which has already been automatically annotated using a separate processing tool and even to interact with the tool during the annotation process. In practice, we have been confronted with the problem of converting the output of different tools to SGML format, while preserving the previous annotation, as well as with the difficulty of post-editing manually an annotated text. It has occurred to us that designing an interface between an annotation tool and any automatic tool would not only provide an easy way of taking advantage of the automatic annotation but it would also allow an easier interactive manual editing of the results. FAST was designed as a manual tagger that can also be used in conjunction with automatic tools for speeding up the human annotation. | |
Coreference Resolution Evaluation Based on Descriptive Specificity | This paper introduces a new evaluation method for the coreference resolution task. Considering that coreference resolution is a matter of linking expressions to discourse referents, we set our evaluation criteron in terms of an evaluation of the denotations assigned to the expressions. This criterion requires that the coreference chains identified in one annotation stand in a one-to-one correspondence with the coreference chains in the other. To determine this correspondence and with a view to keep closer to what human interpretation of the coreference chains would be, we take into account the fact that, in a coreference chain, some expressions are more specific to their referent than others. With this observation in mind, we measure the similarity between the chains in one annotation and the chains in the other, and then compute the optimal similarity between the two annotations. Evaluation then consists in checking whether the denotations assigned to the expressions are correct or not. New measures to analyse errors are also introduced. A comparison with other methods is given at the end of the paper. | |
A Text->Meaning->Text Dictionary and Process | In this article we deal with various applications of a multilingual semantic network named The Integral Dictionary. We revise different commercial applications that uses semantic networks and we show the results with the Integral Dictionary. The details of the semantic calculations are not given here but we show that contrary to the WordNet semantic net, the Integral Dictionary provides most data and relations needed to these calculations. The article presents results and discussion on lexical expanding, lexical reduction, WSD, query expansion, lexical translation extraction, document summary Emails sorting, catalogue access and information retrieval. We conclude that resource like Integral Dictionary can become a good new step for all those who tried to compute semantics with WordNet and that complementary between the two dictionaries could be seriously study in a shared project. | |
A French Phonetic Lexicon with Variants for Speech and Language Processing | This paper reports on a project aiming at the semi-automatic development of a large orthographic-phonetic lexicon for French, based on the Multext dictionary. It details the various stages of the project, with an emphasis on the methodological and design aspects. Information regarding the lexicon’s content is also given, together with a description of interface tools which should facilitate its exploitation. | |
Annotating Communication Problems Using the MATE Workbench | The increasing commercialisation and sophistication of language engineering products reinforces the need for tools and standards in support of a more cost-effective development and evaluation process than has been possible so far.This paper presents results of the MATE project which was launched in response to the need for standards and tools in support of creating,annotating,evaluating and exploiting spoken language resources.Focusing on the MATE workbench,we illustrate its functionality and usability through its use for markup of communication problems. | |
A Methodology for Evaluating Spoken Language Dialogue Systems and Their Components | As spoken language dialogue systems (SLDSs)proliferate in the market place,the issue of SLDS evaluation has come to attract wide interest from research and industry alike.Yet it is only recently that spoken dialogue engineering researchers have come to face SLDSs evaluation in its full complexity.This paper presents results of the European DISC project concerning technical evaluation and usability evaluation of SLDSs and their components.The paper presents a methodology for complete and correct evaluation of SLDSs and components together with a generic evaluation template for describing the evaluation criteria needed. | |
Evaluating Translation Quality as Input to Product Development | In this paper we present a corpus-based method to evaluate the translation quality of machine translation (MT) systems. We start with a shallow analysis of a large corpus and gradually focus the attention on the translation problems. The method constitutes an efficient way to identify the most important grammatical and lexical weaknesses of an MT system and to guide development towards improved translation quality. The evaluation described in the paper was carried out as a cooperation between an MT technology developer, Sail Labs, and the Computational Linguistics group at the University of Zurich. | |
Evaluation of Word Alignment Systems | Recent years have seen a few serious attempts to develop methods and measures for the evaluation of word alignment systems, notably the Blinker project (Melamed, 1998) and the ARCADE project (Veronis and Langlais, forthcoming). In this paper we discuss different approaches to the problem and report on results from a project where two word alignment systems have been evaluated. These results include methods and tools for the generation of reference data and a set of measures for system performance. We note that the selection and sampling of reference data can have a great impact on scoring results. | |
How To Evaluate and Compare Tagsets? A Proposal | We propose a methodology which allows an evaluation of distributional qualities of a tagset and a comparison between tagsets. Evaluation of tagset is crucial since the task of tagging is often considered as one of the first tasks in language processing. The aim of tagging is to summarise as well as possible linguistic information for further processing such as syntactic parsing. The idea is to consider these further steps in order to evaluate a given tagset, and thus to measure the pertinence of the information provided by the tagset for these steps. For this purpose, a Machine Learning system, ALLiS, is used, whose goal is to learn phrase structures from bracketed corpora and to generate formal grammar which describes these structures. ALLiS learning is based on the detection of structural regularities. By this means, it can be pointed out some non-distributional behaviours of the tagset, and thus some of its weaknesses or its inadequacies. | |
Determining the Tolerance of Text-handling Tasks for MT Output | With the explosion of the internet and access to increased amounts of information provided by international media, the need to process this abundance of information in an efficient and effective manner has become critical. The importance of machine translation (MT) in the stream of information processing has become apparent. With this new demand on the user community comes the need to assess an MT system before adding such a system to the user’s current suite of text-handling applications. The MT Functional Proficiency Scale project has developed a method for ranking the tolerance of a variety of information processing tasks to possibly poor MT output. This ranking allows for the prediction of an MT system’s usefulness for particular text-handling tasks. | |
A Parallel Corpus of Italian/German Legal Texts | This paper presents the creation of a parallel corpus of Italian and German legal documents which are translations of one another. The corpus, which contains approximately 5 mio. words, is primarily intended as a resource for (semi-)automatic terminology acquisition. The guidelines of the Corpus Encoding Standard have been applied for encoding structural information, segmentation information, and sentence alignment. Since the parallel texts have a one-to-one correspondence on the sentence level, building a perfect sentence alignment is rather straightforward. As a result of this the corpus constitutes also a valuable testbed for the evaluation of alignment algorithms. The paper discusses the intended use of the corpus, the various phases of corpus compilation, and basic statistics. | |
Integrating Seed Names and ngrams for a Named Entity List and Classifier | We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a memory-based machine learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61 % and a recall of 56 %. On free text, named-entity token labeling accuracy is 71 %. | |
Automatically Expansion of Thesaurus Entries with a Different Thesaurus | We propose a method for expanding the entries in a thesaurus using a di erent thesaurus constructed with another concept. This method constructs a mapping table between the concept codes of these two di erent thesauri. Then, almost all of the entries of the latter thesaurus are assigned the concept codes of the former thesaurus with the mapping table between them. To con rm whether this method is e ective or not,we construct a mapping table between the ''Kadokawa- shin-ruigo'' thesaurus (hereafter, ''ShinRuigo'') and ''Nihongo-goitaikei'' (hereafter, ''Goitaikei''), and assigne about 350 thousand entries with the mapping table. About 10% of the entries cannot be assigned automatically. It is shown that this method can save cost in expanding a thesaurus. | |
Learning Verb Subcategorization from Corpora: Counting Frame Subsets | We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88 % accuracy on unseen parsed text. | |
Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets | The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with a large number of possible word-class tags and only a small (hand-tagged) dataset. We report on training and testing of four different taggers on the Slovene MULTEXT-East corpus containing about 100.000 words and 1000 different morphosyntactic tags. Results show, first of all, that training times of the Maximum Entropy Tagger and the Rule Based Tagger are unacceptably long, while they are negligible for the Memory Based Taggers and the TnT tri-gram tagger. Results on a random split show that tagging accuracy varies between 86% and 89% overall, between 92% and 95% on known words and between 54% and 55% on unknown words. Best results are obtained by TnT. The paper also investigates performance in relation to our EAGLES-based morphosyntactic tagset. Here we compare the per-feature accuracy on the full tagset, and accuracies on these features when training on a reduced tagset. Results show that PoS accuracy is quite high, while accuracy on Case is lowest. Tagset reduction helps improve accuracy, but less than might be expected. | |
Cross-lingual Interpolation of Speech Recognition Models | A method is proposed for implementing the cross-lingual porting of recognition models for rapid prototyping of speech recognisers in new target languages, specifically when the collection of large speech corpora for training would be economically questionable. The paper describes a way to build up a multilingual model which includes the phonetic structure of all the constituent languages, and which can be exploited to interpolate the recognition units of a different language. The CTSU (Classes of Transitory-Stationary Units) approach is exploited to derive a well balanced set of recognition models, as a reasonable trade-off between precision and trainability. The phonemes of the untrained language are then mapped onto the multilingual inventory of recognition units, and the corresponding CTSUs are then obtained. The procedure was tested with a preliminary set of 10 Rumanian speakers starting from an Italian-English-Spanish CTSU model. The optimal mapping of the vowel phone set of this language onto the multilingual phone set was obtained by inspecting the F1 and F2 formants of the vowel sounds from two male and female Rumanian speakers, and by comparing them with the values of F1 and F2 of the other three languages. Results in terms of recognition word accuracy measured on a preliminary test set of 10 speakers are reported. | |
Lexicalised Systematic Polysemy in WordNet | This paper describes an attempt to gain more insight into the mechanisms that underlie lexicalised sy phenomenon is interpreted as systematic sense combinations that are valid for more than one word. The WordNet is exploited to create a working definition of systematic polysemy and extract polysemic patt isation that allows the identification of fine-grained semantic relations between the senses of the words par ic polysemic pattern. | |
Experiences of Language Engineering Algorithm Reuse | Traditionally, the level of reusability of language processing resources within the research community has been very low. Most of the recycling of linguistic resources has been concerned with reuse of data, e.g., corpora, lexica, and grammars, while the algorithmic resources far too seldom have been shared between di?erent projects and institutions. As a consequence, researchers who are willing to reuse somebody else's processing components have been forced to invest major e?orts into issues of integration, inter-process communication, and interface design. In this paper, we discuss the experiences drawn from the svensk project regarding the issues on reusability of language engineering software as well as some of the challenges for the research community which are prompted by them. Their main characteristics can be laid out along three dimensions; technical/software challenges, linguistic challenges, and `political' challenges. In the end, the unavoidable conclusion is that it de?nitely is time to bring more aspects of engineering into the Computational Linguistic community! | |
Derivation in the Czech National Corpus | The aim of this paper is to describe one the main means of Czech word formation - derivation. New Czech words are created by composition or by derivation (by using prefixes or suffixes). The suffixes which are added to the stem are used much more frequently than prefixes standing before the stem. The most frequent suffixes will be classified according to the paradigmatic and semantic properties and according to the changes they cause in the stem. The research is done on the Czech national corpus (CNC), the frequencies of the investigated suffixes illustrate their roductivity in present day Czech language. This research is of a particular value for a highly inflected language such as Czech. Possible applications of this system are various NLP systems, e.g. spelling checkers and machine translation systems. The results of this work serve for the computational processing of Czech word formation and in future for the creation of the Czech derivational dictionary. | |
Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers | This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample. | |
The Context (not only) for Humans | Our context considerations will be practically oriented; we will explore the specification of a context scope in the Czech morphological tagging. We mean by morphological tagging/annotation the automatic/manual disambiguation of the output of morphological analysis. The Prague Dependency Treebank (PDT) serves as a source of annotated data. The main aim is to concentrate on the evaluation of the influence of the chosen context on the tagging accuracy. | |
Something Borrowed, Something Blue: Rule-based Combination of POS Taggers | Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation repre-sents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) taggers can be combined to better cope with text material of a type on which they were not trained, and for which there are no readily available training corpora. We indicate—using freely avail-able taggers for German (although the method we describe is not language-dependent)—how such taggers can be combined by using linguistically motivated rules so that the tagging accuracy of the combination exceeds that of the best of the individual taggers. | |
Screffva: A Lexicographer's Workbench | This paper describes the implementation of Screffva, a computer system written in Prolog that employs a parallel corpus for the automatic generation of bilingual dictionary entries. Screffva provides a lemmatised interface between a parallel corpus and its bilingual dictionary. The system has been trialled with a parallel corpus of Cornish-English bitext. Screffva is able to retrieve any given segment of text, and uniquely identifies lexemes and the equivalences that exist between the lexical items in a bitext. Furthermore the system is able to cope with discontinuous multiword lexemes. The system is thus able to find glosses for individual lexical items or to produce longer lexical entries which include part-of-speech, glosses and example sentences from the corpus. The corpus is converted to a Prolog text database and lemmatised. Equivalents are then aligned. Finally Prolog predicates are defined for the retrieval of glosses, part-of-speech and example sentences to illustrate usage. Lexemes, including discontinuous multiword lexemes, are uniquely identified by the system and indexed to their respective segments of the corpus. Insofar as the system is able to identify specific translation equivalents in the bitext, the system provides a much more powerful research tool than existing concordancers such as ParaConc, WordSmith, XCorpus and Multiconcord. The system is able to automatically generate a bilingual dictionary which can be exported and used as the basis for a paper dictionary. Alternatively the system can be used directly as an electronic bilingual dictionary. | |
A Step toward Semantic Indexing of an Encyclopedic Corpus | This paper investigates a method for extracting and acquiring knowledge from Linguistic resources. In particular, we propose an NLP based architecture for building a semantic network out of an XML on line encyclopedic corpus. The general application underlying this work is a question-answering system on proper nouns within an encyclopedia. | |
Issues in the Evaluation of Spoken Dialogue Systems - Experience from the ACCeSS Project | We describe the framework and present detailed results of an evaluation of 1.500 dialogues recorded during a three-months field-trial of the ACCeSS Dialogue System. The system was routing incoming calls to agents of a call-center and handled about 100 calls per day. | |
Evaluating Summaries for Multiple Documents in an Interactive Environment | While most people have a clear idea of what a single document summary should look like, this is not immediately obvious for a multi-document summary. There are many new questions to answer concerning the amount of documents to be summarized, the type of documents, the kind of summary that should be generated, the way the summary gets presented to the user, etc. The many approaches possible to multi-document summarization makes evaluation especially difficult. In this paper we will describe an approach to multi-document summarization and report work on an evaluation method for this particular system. | |
Grammarless Bracketing in an Aligned Bilingual Corpus | We propose a simple grammarless procedure to extract phrasal examples from aligned parallel texts. Is is based on the difference of word sequence in two languages. | |
A Semi-automatic System for Conceptual Annotation, its Application to Resource Construction and Evaluation | The CONCERTO project, primarily concerned with the annotation of texts for their conceptual content, combines automatic linguistic analysis with manual annotation to ensure the accuracy of fact extraction, and to encode content in a rich knowledge representation framework. The system provides annotation tools, automatic multi-level linguistic analysis modules, a partial parsing formalism with a more user friendly language than standard regular expression languages, XML-based document management, and a powerful knowledge representation and query facility. We describe the architecture and functionality of the system, and how it can be adapted for a range of resource construction tasks, and how the system can be configured to compute statistics on the accuracy of its automatic analysis components. | |
The MATE Workbench Annotation Tool, a Technical Description | The MATE workbench is a tool which aims to simplify the tasks of annotating, displaying and querying speech or text corpora. It is designed to help humans create language resources, and to make it easier for different groups to use one another’s data, by providing one tool which can be used with many different annotation schemes. Any annotation scheme which can be converted to XML can be used with the workbench, and display formats optimised for particular annotation tasks are created using a transformation language similar to XSLT. The workbench is written entirely in Java, which means that it is platform-independent. | |
Recruitment Techniques for Minority Language Speech Databases: Some Observations | This paper describes the collection efforts for SpeechDat Cymru, a 2000-speaker database for Welsh, a minority language spoken by about 500,000 of the Welsh population. The database is part of the SpeechDat(II) project. General database details are discussed insofar as they affect recruitment strategies, and likely differences between minority language spoken language resource (SLR) and general SLR collection are noted. Individual recruitment techniques are then detailed, with an indication of their relative successes and relevance to minority language SLR collection generally. It is observed that no one technique was sufficient to collect the entire database, and that those techniques involving face-to-face recruitment by an individual closely involved with the database collection produced the best yields for effort expended. More traditional postal recruitment techniques were less successful. The experiences during collection underlined the importance of utilising enthusiastic recruiters, and taking advantage of the speaker networks present in the community. | |
Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation | Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success. | |
PoS Disambiguation and Partial Parsing Bidirectional Interaction | This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head information as a word analysis. This interaction of PoS disambiguation and partial parsing reduces the effort needed for writing rules considerably. Furthermore, the methodology we propose improves both efficiency and results. | |
Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis | This paper presents a taxonomy of previous work on infrastructures, architectures and development environments for representing and processing Language Resources (LRs), corpora, and annotations. This classification is then used to derive a set of requirements for a Software Architecture for Language Engineering (SALE). The analysis shows that a SALE should address common problems and support typical activities in the development, deployment, and maintenance of LE software. The results will be used in the next phase of construction of an infrastructure for LR production, distribution, and access. | |
XCES: An XML-based Encoding Standard for Linguistic Corpora | The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications. We have instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and ''standoff'' annotation in separate documents. Conversion to XML enables use of some of the more powerful mechanisms provided in the XML framework, including the XSLT Transformation Language, XML Schemas, and support for inter-rescue reference together with an extensive path syntax for pointers. In this paper, we describe the differences between the CES and XCES DTDs and demonstrate how XML mechanisms can be used to select from and manipulate annotated corpora encoded according to XCES specifications. We also provide a general overview of XML and the XML mechanisms that are most relevant to language engineering research and applications. | |
Named Entity Recognition in Greek Texts | In this paper, we describe work in progress for the development of a named entity recognizer for Greek. The system aims at information extraction applications where large scale text processing is needed. Speed of analysis, system robustness, and results accuracy have been the basic guidelines for the system’s design. Our system is an automated pipeline of linguistic components for Greek text processing based on pattern matching techniques. Non-recursive regular expressions have been implemented on top of it in order to capture different types of named entities. For development and testing purposes, we collected a corpus of financial texts from several web sources and manually annotated part of it. Overall precision and recall are 86% and 81% respectively. | |
A Robust Parser for Unrestricted Greek Text | In this paper we describe a method for the efficient parsing of real-life Greek texts at the surface syntactic level. A grammar consisting of non-recursive regular expressions describing Greek phrase structure has been compiled into a cascade of finite state transducers used to recognize syntactic constituents. The implemented parser lends itself to applications where large scale text processing is involved, and fast, robust, and relatively accurate syntactic analysis is necessary. The parser has been evaluated against a ca 34000 word corpus of financial and news texts and achieved promising precision and recall scores. | |
A Computational Platform for Development of Morphologic and Phonetic Lexica | Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language. The integrated graphic user interface focuses in morphologic and phonetic aspects of the Slovenian language and allows the experts good performance in analysis time. | |
An Open Architecture for the Construction and Administration of Corpora | The use of language corpora for a variety of purposes has increased significantly in recent years. General corpora are now available for many languages, but research often requires more specialized corpora. The rapid development of the World Wide Web has greatly improved access to data in electronic form, but research has tended to focus on corpus annotation, rather than on corpus building tools. Therefore many researchers are building their own corpora, solving problems independently, and producing project-specific systems which cannot easily be re-used. This paper proposes an open client-server architecture which can service the basic operations needed in the construction and administration of corpora, but allows customisation by users in order to carry out project-specific tasks. The paper is based partly on recent practical experience of building a corpus of 10 million words of Written Business English from webpages, in a project which was co-funded by ELRA and the University of Wolverhampton. | |
Design of Optimal Slovenian Speech Corpus for Use in the Concatenative Speech Synthesis System | In the paper the development of Slovenian speech corpus for use in concatenative speech synthesis system being developed at University of Maribor, Slovenia, will be presented. The emphasis in the paper is the issue of maximising the usefulness of the defined speech corpus for concatenation purposes. Usefulness of the speech corpus very much depends on the corresponding text and can be increased if the appropriate text is chosen. In the approach we used, detailed statistics of the text corpora has been done, to be able to define the sentences, rich with non-uniform units like monophones, diphones and triphones. | |
CLinkA A Coreferential Links Annotator | The annotation of coreferential chains in a text is a difficult task, which requires a lot of concentration. Given its complexity, without an appropriate tool it is very difficult to produce high quality coreferentially annotated corpora. In this paper we discus the requirements for developing a tool for helping the human annotator in this task. The annotation scheme used by our program is derived from the one proposed by MUC-7 Coreference Task Annotation, but is not restricted only to that one. Using a very simple language the user is able to define his/her own annotation scheme. The tool has a user-friendly interface and is language and platform independent. | |
What's in a Thesaurus? | We first describe four varieties of thesaurus: (1) Roget-style, produced to help people find synonyms when they are writing; (2) WordNet and EuroWordNet; (3) thesauruses produced (manually) to support information retrieval systems; and (4) thesauruses produced auto-matically from corpora. We then contrast thesauruses and dictionaries, and present a small experiment in which we look at polysemy in relation to thesaurus structure. It has sometimes been assumed that different dictionary senses for a word that are close in meaning will be near neighbours in the thesaurus. This hypothesis is explored, using as inputs the hierarchical structure of WordNet 1.5 and a mapping between WordNet senses and the senses of another dictionary. The experiment shows that pairs of ‘lexicographically close’ meanings are frequently found in different parts of the hierarchy. | |
A Unified POS Tagging Architecture and its Application to Greek | This paper proposes a flexible and unified tagging architecture that could be incorporated into a number of applications like information extraction, cross-language information retrieval, term extraction, or summarization, while providing an essential component for subsequent syntactic processing or lexicographical work. A feature-based multi-tiered approach (FBT tagger) is introduced to part-of-speech tagging. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek. Additionally, a large experiment concerning the Greek language is conducted and results are presented for a variety of text genres, including financial reports, newswires, press releases and technical manuals. Finally, the adopted evaluation methodology is discussed. | |
Resources for Lexicalized Tree Adjoining Grammars and XML Encoding: TagML | This work addresses both practical and theorical purposes for the encoding and the exploitation of linguistic resources for feature based Lexicalized Tree Adjoining grammars (LTAG). The main goals of these specifications are the following ones: 1. Define a recommendation by the way of an XML (Bray et al., 1998) DTD or schema (Fallside, 2000) for encoding LTAG resources in order to exchange grammars, share tools and compare parsers. 2. Exploit XML, its features and the related recommendations for the representation of complex and redundant linguistic structures based on a general methodology. 3. Study the resource organisation and the level of generalisation which are relevant for a lexicalized tree grammar. | |
Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers | We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types). | |
ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation | We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on “Annotation Graphs,” a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic “signals,” including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture. | |
Models of Russian Text/Speech Interactive Databases for Supporting of Scientific, Practical and Cultural Researches | The paper briefly describes the following databases: ”Online Sound Archives from St. Petersburg Collections”, ”Regional Variants of the Russian Speech”, and ”Multimedia Dictionaries of the minor Languages of Russia”, the principle feature of which is the built-in support for scientific, practical and cultural researches. Though these databases are addressed to researchers engaged mainly in Spoken Language Processing and because of that their main object is Sound, proposed database ideology and general approach to text/speech data representation and access may be further used for elaboration of various language resources containing text, audio and video data. Such approach requests for special representation of the database material. Thus, all text and sound files should be accompanied by information on their multi-level segmentation, which should allow the user to extract and analyze any segment of text or speech. Each significant segment of the database should be perceived as a potential object of investigation and should be supplied by tables of descriptive parameters, mirroring its various characteristics. The list of these parameters for all potential objects is open for further possible extension. | |
Some Technical Aspects about Aligning Near Languages | IULA at UPF has developed an aligner that benefits from corpus processing results to produce an accurate and robust alignment, even with noisy parallel corpora. It compares lemmata and part-of-speech tags of analysed texts but it has two main characteristics. First, apparently it only works for near languages and second it requires morphological taggers for the compared languages. These two characteristics prevent this technique from being used for any pair of languages. Whevener it its applicable, a high quality of results is achieved. | |
Corpus Resources and Minority Language Engineering | Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g. Charoenporn, 1997, Hall and Hudson, 1997, Somers, 1997, Nirenberg and Raskin, 1998, Somers, 1998). However, much work related to low density languages is still in its infancy, or worse, work is blocked because the resources needed by language engineers are not available. In response to this situation, the MILLE (Minority Language Engineering) project was established by the Engineering and Physical Sciences Research Council (EPSRC) in the UK to discover what language corpora should be built to enable language engineering work on non-indigenous minority languages in the UK, most of which are typically low- density languages. This paper summarises some of the major findings of the MILLE project. | |
CDB - A Database of Lexical Collocations | CDB is a relational database designed for the particular needs of representing lexical collocations. The relational model is defined such that competence-based descriptions of collocations (the competence base) and actually occurring collocation examples extracted from text corpora (the example base) complete each other. In the paper, the relational model is described and examples for the representation of German PP-verb collocations are given. A number of example queries are presented, and additional facilities which are built on top of the database are discussed. | |
Evaluation for Darpa Communicator Spoken Dialogue Systems | The overall objective of the DARPA COMMUNICATOR project is to support rapid, cost-effective development of multi-modal speech-enabled dialogue systems with advanced conversational capabilities, such as plan optimization, explanation and negotiation. In order to make this a reality, we need to find methods for evaluating the contribution of various techniques to the users’ willingness and ability to use the system. This paper reports on the approach to spoken dialogue system evaluation that we are applying in the COMMUNICATOR program. We describe our overall approach, the experimental design, the logfile standard, and the metrics applied in the experimental evaluation planned for June of 2000. | |
Transcribing with Annotation Graphs | Transcriber is a tool for manual annotation of large speech files. It was originally designed for the broadcast news transcription task. The annotation file format was derived from previous formats used for this task, and many related features were hard-coded. In this paper we present a generalization of the tool based on the annotation graph formalism, and on a more modular design. This will allow us to address new tasks, while retaining Transcriber’s simple, crisp user-interface which is critical for user acceptance. | |
Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results | We are annotating a corpus with information relevant to discourse entity realization, and especially the information needed to decide which type of NP to use. The corpus is being used to study correlations between NP type and certain semantic or discourse features, to evaluate hand-coded algorithms, and to train statistical models. We report on the development of our annotation scheme, the problems we have encountered, and the results obtained so far. | |
Towards a Query Language for Annotation Graphs | The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation. | |
The American National Corpus: A Standardized Resource for American English | At the first conference on Language Resources and Evaluation, Granada 1998, Charles Fillmore, Nancy Ide, Daniel Jurafsky, and Catherine Macleod proposed creating an American National Corpus (ANC) that would compare with the British National Corpus (BNC) both in balance and in size (one hundred million words). This paper reports on the progress made over the past two years in launching the project. At present, the ANC project is well underway, with commitments for support and contribution of texts from a number of publishers world-wide. | |
Semantic Tagging for the Penn Treebank | This paper describes the methodology that is being used to augment the Penn Treebank annotation with sense tags and other types of semantic information. Inspired by the results of SENSEVAL, and the high inter-annotator agreement that was achieved there, similar methods were used for a pilot study of 5000 words of running text from the Penn Treebank. Using the same techniques of allowing the annotators to discuss difficult tagging cases and to revise WordNet entries if necessary, comparable inter-annotator rates have been achieved. The criteria for determining appropriate revisions and ensuring clear sense distinctions are described. We are also using hand correction of automatic predicate argument structure information to provide additional thematic role labeling. | |
Rule-based Tagging: Morphological Tagset versus Tagset of Analytical Functions | This work presents a part of a more global study on the problem of parsing of Czech and on the knowledge extraction capabilities of the Rule-based method. It is shown that the successfulness of the Rule-based method for English and its unsuccessfulness for Czech, is not only due to the small cardinality of the English tagset (as it is usually claimed) but mainly depends on its structure (”regularity” of the language information). | |
The (Un)Deterministic Nature of Morphological Context | The aim of this paper is to contribute to the study of the context within natural language processing and to bring in aspects which, I believe, have a direct influence on the interpretation of the success rates and on a more successful design of language models. This work tries to formalize the (ir)regularities, dynamic characteristics, of context using techniques from the field of chaotic and non-linear systems. The observations are done on the problem of POS tagging. | |
A Framework for Cross-Document Annotation | We introduce a cross-document annotation toolset that serves as a corpus-wide knowledge base for linguistic annotations. This imple-mented system is designed to address the unique cognitive demands placed on human annotators who must relate information that is expressed across document boundaries. | |
Extraction of Concepts and Multilingual Information Schemes from French and English Economics Documents | This paper focuses on the linguistic analysis of economic information in French and English documents. Our objective is to establish domain-specific information schemes based on structural and conceptual information. At the structural level, we define linguistic triggers that take into account each language's specificity. At the conceptual level, analysis of concepts and relations between concepts result in a classification, prior to the representation of schemes. The final outcome of this study is a mapping between linguistic and conceptual structures in the field of economics. | |
How to Evaluate Your Question Answering System Every Day ... and Still Get Real Work Done | In this paper, we report on Qaviar, an experimental automated evaluation system for question answering applications. The goal of our research was to find an automatically calculated measure that correlates well with human judges' assessment of answer correctness in the context of question answering tasks. Qaviar judges the response by computing recall against the stemmed content words in the human-generated answer key. It counts the answer correct if it exceeds a given recall threshold. We determined that the answer correctness predicted by Qaviar agreed with the human 93% to 95% of the time. 41 question-answering systems were ranked by both Qaviar and human assessors, and these rankings correlated with a Kendall’s Tau measure of 0.920, compared to a correlation of 0.956 between human assessors on the same data. | |
What are Transcription Errors and Why are They made? | In recent work we compared transcriptions of German spontaneous dialogues of the VERBMOBIL corpus to ascertain differences between transcribers and quality. A better understanding of where and what kind of inconsistencies occur will help us to improve the working environment for transcribers, to reduce the effort on correction passes, and will finally result in better transcription quality. The results show that transcribers have different levels of perception of spontaneous speech phenomena, mainly prosodic phenomena such as pauses in speech and lengthening. During the correction pass 80% of these labels had to be inserted. Additionally, the annotation of non-grammatical phrases and pronunciation comments seems to need a better explanation in the convention manual. Here the correcting transcribers had to change 20% of the annotations. | |
On the Usage of Kappa to Evaluate Agreement on Coding Tasks | In recent years, the Kappa coefficient of agreement has become the de facto standard to evaluate intercoder agreement in the discourse and dialogue processing community. Together with the adoption of this standard, researchers have adopted one specific scale to evaluate Kappa values, the one proposed in (Krippendorff, 1980). In this paper, I highlight some issues that should be taken into account when evaluating Kappa values. Finally, I speculate on whether Kappa could be used as a measure to evaluate a system’s performance. | |
Automatic Extraction of English-Chinese Term Lexicons from Noisy Bilingual Corpora | This paper describes our system, which is designed to extract English-Chinese term lexicons from noisy complex bilingual corpora and use them as translation lexicon to check sentence alignment results. The noisy bilingual corpora are aligned firstly by our improved length based statistical approach, which could detect sentence omission and insertion partly. A term extraction system is used to obtain term translation lexicons form roughly aligned corpora. Then the statistical approach is used to align the corpora again. Finally, we filter the noisy bilingual texts and obtain nearly perfect alignment corpora. | |
Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium | The Linguistic Data Consortium (LDC) is a non-profit consortium of universities, companies and government research laboratories that supports education, research and technology development in language related disciplines by collecting or creating, distributing and archiving language resources including data and accompanying tools, standards and formats. LDC was founded in 1992 with a grant from the Defense Advanced Research Projects Agency (DARPA) to the University of Pennsylvania as host organization. LDC publication and distribution activities self-support from membership fees and data sales while new data creation is supported primarily by grants from DARPA and the National Science Foundation. Recent developments in the creation and use of language resources demand new roles for international data centers. Since our report at the last Language Resource and Evaluation Conference in Granada in 1998, LDC has observed growth in the demand for language resources along multiple dimensions: larger corpora with more sophisticated annotation in a wider variety of languages are used in an increasing number of language related disciplines. There is also increased demand for reuse of existing corpora. Most significantly, small research groups are taking advantage of advances in microprocessor technology, data storage and internetworking to create their own corpora. This has lead to the birth of new annotation practices whose very variety creates barriers to data sharing. This paper will describe recent LDC efforts to address emerging issues in the creation and distribution of language resources. | |
Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts | This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora. | |
Using Machine Learning Methods to Improve Quality of Tagged Corpora and Learning Models | Corpus-based learning methods for natural language processing now provide a consistent way to achieve systems with good performance. A number of statistical learning models have been proposed and are used in most of the tasks which used to be handled by rule-based systems. When the learning systems come to such a level as competitive as manually constructed systems, both large scale training corpora and good learning models are of great importance. In this paper, we first discuss that the main hindrances to the improvement of corpus-based learning systems are the inconsistencies or the errors existing in the training corpus and the defectiveness in the learning model. We then show that some machine learning methods are useful for effective identification of the erroneous source in the training corpus. Finally, we discuss how the various types of errors should be coped with so as to improve the learning environments. | |
Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora | The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and radio sources from English and Mandarin Chinese. The most recent TDT corpus, TDT3, added two tasks, story link and first story detection. Annotation of the TDT corpora involved a large staff of annotators who produced millions of human judgements. As with any large corpus creation effort, quality assurance and inter-annotator consistency were a major concern. This paper reports the quality control measures adopted by the LDC during the creation of the TDT corpora, presents techniques that were utilized to evaluate and improve the consistency of human annotators for all annotation tasks, and discusses aspects of project administration that were designed to enhance annotation consistency. | |
Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing | (Utsuro et al., 2000) proposed statistical method for learning dependency preference of Japanese subordinate clauses, in which scopeembedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Following (Utsuro et al., 2000), this paper presents detailed results of evaluating the proposed method by comparing it with several closely related existing techniques and shows that the proposed method outperforms those existing techniques. | |
Live Lexicons and Dynamic Corpora Adapted to the Network Resources for Chinese Spoken Language Processing Applications in an Internet Era | In the future network era, huge volume of information on all subject domains will be readily available via the network. Also, all the network information are dynamic, ever-changing and exploding. Furthermore, many of the spoken language processing applications will have to do with the content of the network information, which is dynamic. This means dynamic lexicons, language models and so on will be required. In order to cope with such a new network environment, automatic approaches for the collection, classification, indexing, organization and utilization of the linguistic data obtainable from the networks for language processing applications will be very important. On the one hand, high performance spoken language technology can hopefully be developed based on such dynamic linguistic data on the network. On the other hand, it is also necessary that such spoken language technology can be intelligently adapted to the content of the dynamic and the ever-changing network information. Some basic concept for live lexicons and dynamic corpora adapted to the network resources has been developed for Chinese spoken language processing applications and briefly summarized here in this paper. Although the major considerations here are for Chinese language, the concept may equally apply to other languages as well. | |
Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation | For several years we have been conducting Accuracy Based Evaluations (ABE) of the JANUS speech-to-speech MT system (Gates et al., 1997) which measure quality and fidelity of translation. Recently we have begun to design a Task Based Evaluation for JANUS (Thomas, 1999) which measures goal completion. This paper describes what we have learned by comparing the two types of evaluation. Both evaluations (ABE and TBE) were conducted on a common set of user studies in the semantic domain of travel planning. | |
Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus | This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator. | |
The Influence of Scenario Constraints on the Spontaneity of Speech. A Comparison of Dialogue Corpora | In this article we compare two large scale dialogue corpora recorded in different settings. The main differences are unrestricted turn-taking vs. push-to-talk button and complex vs. simple negotiation task. In our investigation we found that vocabulary, durations of turns, words and sounds as well as prosodical features are influenced by differences in the setting. | |
Automatic Assignment of Grammatical Relations | This paper presents a method for the assignment of grammatical relation labels in a sentence structure. The method has been implemented in the software tool AGRA (Automatic Grammatical Relation Assigner), which is part of a project for the development of a treebank of Italian sentences, and a knowledge base of Italian subcategorization frames. The annotation schema implements a notion of underspecification, that arranges grammatical relations from generic to specific one onto a hierarchy; the software tool works with hand-coded rules, which apply heuristic knowledge (on syntactic and semantic cues) to distinguish between complements and modifiers. | |
Integrating Subject Field Codes into WordNet | In this paper, we present a lexical resource where WordNet synsets are annotated with Subject Field Codes. We discuss both the methodological issues we dealt with and the annotation techniques used. A quantitative analysis of the resource coverage, as well as a qualitative evaluation of the proposed annotations, are reported. | |
Building a Treebank for Italian: a Data-driven Annotation Schema | Many natural language researchers are currently turning their attention to treebank development and trying to achieve accuracy and corpus data coverage in their representation formats. This paper presents a data-driven annotation schema developed for an Italian treebank ensuring data coverage and consistency between annotation of linguistic phenomena. The schema is a dependency-based format centered upon the notion of predicate-argument structure augmented with traces to represent discontinuous constituents. The treebank development involves an annotation process performed by a human annotator helped by an interactive parsing tool that builds incrementally syntactic representation of the sentence. To increase the syntactic knowledge of this parser, a specific data-driven strategy has been applied. We describe the cyclical development of the annotation schema highlighting the richness and flexibility of the format, and we present some representational issues. | |
Typographical and Orthographical Spelling Error Correction | This paper focuses on selection techniques for best correction of misspelt words at the lexical level. Spelling errors are introduced by either cognitive or typographical mistakes. A robust spelling correction algorithm is needed to cover both cognitive and typographical errors. For the most effective spelling correction system, various strategies are considered in this paper: ranking heuristics, correction algorithms, and correction priority strategies for the best selection. The strategies also take account of error types, syntactic information, word frequency statistics, and character distance. The findings show that it is very hard to generalise the spelling correction strategy for various types of data sets such as typographical, orthographical, and scanning errors. | |
Application of WordNet ILR in Czech Word-formation | The aim of this paper is to describe some typical word formation procedures in Czech and to show how the internal language relations (ILR) as they are introduced in Czech WordNet can be related to the chosen derivational processes. In our exploration we have paid attention to the roles of agent, location, instrument and subevent which yield the most regular and rich ways of suffix derivation in Czech. We also deal with the issues of the translation equivalents and corresponding lexical gaps that had to be solved in the framework of EuroWordNet 2 (confronting Czech with English) since they are basically brought about by verb prefixation (single, double, verb aspect pairs) or noun suffixation (diminutives, move in gender). Finally, we try to demonstrate that the mentioned derivational processes can be employed to extend Czech lexical resources in a semiautomatic way. | |
POSCAT: A Morpheme-based Speech Corpus Annotation Tool | As more and more speech systems require linguistic knowledge to accommodate various levels of applications, corpora that are tagged with linguistic annotations as well as signal-level annotations are highly recommended for the development of today’s speech systems. Among the linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single unified speech corpus annotation tool that enables corpus builders to link linguistic annotations to signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme-based linguistic engines. Our tool integrates a syntactic analyzer, phrase break detector, grapheme-to-phoneme converter and automatic phonetic aligner together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool are multi-purpose and applicable to both speech recognition and text-to-speech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines. | |
A Flexible Infrastructure for Large Monolingual Corpora | In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de). | |
Automatic Transliteration and Back-transliteration by Decision Tree Learning | Automatic transliteration and back-transliteration across languages with drastically different alphabets and phonemes inventories such as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, cross-lingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent language independent methodology for English/Korean transliteration and back-transliteration is described. Our method is composed of character alignment and decision tree learning. We induce transliteration rules for each English alphabet and back-transliteration rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of transliteration and back-transliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly accurately aligning English word and Korean transliteration in a desired way. | |
Shallow Discourse Genre Annotation in CallHome Spanish | The classification of speech genre is not yet an established task in language technologies. However we believe that it is a task that will become fairly important as large amounts of audio (and video) data become widely available. The technological cability to easily transmit and store all human interactions in audio and video could have a radical impact on our social structure. The major open question is how this information can be used in practical and beneficial ways. As a first approach to this question we are looking at issues involving information access to databases of human-human interactions. Classification by genre is a first step in the process of retrieving a document out of a large collection. In this paper we introduce a local notion of speech activities that are exist side-by-side in conversations that belong to speech-genre: While the genre of CallHome Spanish is personal telephone calls between family members the actual instances of these calls contain activities such as storytelling, advising, interrogation and so forth. We are presenting experimental work on the detection of those activities using a variety of features. We have also observed that a limited number of distinguised activities can be defined that describes most of the activities in this database in a precise way. | |
Building a Treebank for French | Very few gold standard annotated corpora are currently available for French. We present an ongoing project to build a reference treebank for French starting with a tagged newspaper corpus of 1 Million words (Abeille et al., 1998), (Abeille and Clement, 1999). Similarly to the Penn TreeBank (Marcus et al., 1993), we distinguish an automatic parsing phase followed by a second phase of systematic manual validation and correction. Similarly to the Prague treebank (Hajicova et al., 1998), we rely on several types of morphosyntactic and syntactic annotations for which we define extensive guidelines. Our goal is to provide a theory neutral, surface oriented, error free treebank for French. Similarly to the Negra project (Brants et al., 1999), we annotate both constituents and functional relations. | |
Establishing the Upper Bound and Inter-judge Agreement of a Verb Classification Task | Detailed knowledge about verbs is critical in many NLP and IR tasks, yet manual determination of such knowledge for large numbers of verbs is difficult, time-consuming and resource intensive. Recent responsesto this problem have attempted to classify verbs automatically, as a first step to automatically build lexical resources. In order to estimate the upper bound of a verb classification task, which appears to be difficult and subject to variability among experts, we investigated the performance of human experts in controlled classification experiments. We report here the results of two experiments—using a forced-choice task and a non-forced choice task—which measure human expert accuracy (compared to a gold standard) in classifying verbs into three pre-defined classes, as well as inter-expert agreement. To preview, we find that the highest expert accuracy is 86.5% agreement with the gold standard, and that inter-expert agreement is not very high (K between .53 and .66). The two experiments show comparable results. | |
Layout Annotation in a Corpus of Patient Information Leaflets | We discuss the problems and issues that arised during the development of a procedure for annotating layout in a corpus of Patient Information Leaflets. We show how the genre of the corpus as well as the aim of the annotation influenced the annotation scheme. We also describe the automatic annotation procedure. | |
A New Methodology for Speech Corpora Definition from Internet Documents | In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described. | |
Coping with Lexical Gaps when Building Aligned Multilingual Wordnets | In this paper we present a methodology for automatically classifying the translation equivalents of a machine readable bilingual dictionary in three main groups: lexical units, lexical gaps (that is cases when a lexical concept of a language does not have a correspondent in the other language) and translation equivalents that need to be manually classified as lexical units or lexical gaps. This preventive classification reduces the manual work necessary to cope with lexical gaps in the construction of aligned multilingual wordnets. | |
Design and Construction of Knowledge base for Verb using MRD and Tagged Corpus | This paper represents the procedure of building syntactic knowledge base. This study is to construct basic sentence pattern automatically by using the POS-tagged corpus in balanced KAIST corpus, and electronic dictionary for Korean, and to construct syntactic knowledge base with specific information added to the lexicographer's analysis. The summary of work process will be as follows: 1) Extraction of characteristic verb targeting the high frequency verb from KAIST corpus 2) Constructing sentence pattern from each verb case frame structure extracted from MRD 3) Making out the noun categories of sentence pattern through KCP examples 4) Semantic classification of selected verb suitable for classified sentence pattern 5) Description of hyper concept to individual noun categories 6) Putting the translated words in Japanese to each noun and verb | |
Introduction of KIBS (Korean Information Base System) Project | This project has been carried out on the basis of resources and tools for Korean NLP. The main research is the construction of raw corpus of 64 million tokens and Part-of-Speech tagged corpus of about 11 million tokens. And we develop some analytic tools to construct and some supporting tools to navigate them. This paper represents the present state of the work carried out by the KIBS project. We introduce a KAIST tag set of POS and syntax for standard corpus and annotation principles. And we explain several error types represented in tagged corpus. | |
Resources for Multilingual Text Generation in Three Slavic Languages | The paper discusses the methods followed to re-use a large-scale, broad-coverage English grammar for constructing similar scale grammars for Bulgarian, Czech and Russian for the fast prototyping of a multilingual generation system. We present (1) the theoretical and methodological basis for resource sharing across languages, (2) the use of a corpus-based contrastive register analysis, in particular, contrastive analysis of mood and agency. Because the study concerns reuse of the grammar of a language that is typologically quite different from the languages treated, the issues addressed in this paper appear relevant to a wider range of researchers in need of large-scale grammars for less-researched languages. | |
A Multi-view Hyperlexicon Resource for Speech and Language System Development | New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically useful, computationally feasible, and theoretically well- founded. We describe the specification, three-level lexical document design principles, and implementation of a MARTIF document structure and several presentation structures for a terminological lexicon, including both on demand access and full hypertext lexicon compilation. The underlying resource is a relational lexical database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph structure which defines the macrostructure of the hyperlexicon. | |
Enabling Resource Sharing in Language Generation: an Abstract Reference Architecture | The RAGS project aims to develop a reference architecture for natural language generation,to facilitate modular development of NLG systams as well as evaluation of components, systems and algorithms. This paper gives an overview of the proposed framework, describing an abstract data model with five levels of representation: Conceptual, Semantic, Rhetorical, Document and Syntactic. We report on a re-implementation of an existing system using the RAGS data model. | |
Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language | In this paper, different issues in design, collection and evaluation of the large vocabulary telephone speech corpus of Slovenian language are discussed. The database is composed of three text corpora containing 1530 different sentences. It contains read speech of 82 speakers where each speaker read in average more than 200 sentences and 21 speakers read also the text passage of 90 sentences. The initial manual segmentation and labeling of speech material was performed. Based on this the automatic segmentation was carried out. The database should facilitate the development of speech recognition systems to be used in dictation tasks over the telephone. Until now the database was used mostly for isolated digit recognition tasks and word spotting. | |
ARC A3: A Method for Evaluating Term Extracting Tools and/or Semantic Relations between Terms from Corpora | This paper describes an ongoing project evaluating Natural Language Processing (NLP) systems. The aim of this project is to test software capabilities in automatic or semi-automatic extraction of terminology from French corpora in order to build tools used in NLP applications. We are putting forward a strategy based on qualitative evaluation. The idea is to submit the results to specialists (i.e. field specialists, terminologists and/or knowledge engineers). The research we are conducting is sponsored by the ''Association des Universites Francophones'' (AUF) an international Organisation whose mission is to promote the dissemination of French as a scientific medium. Software submitted to this evaluation are conceived by French, Canadian and US research institutions (National Scientific Research Centre and Universities) and/or companies : CNRS (France), XEROX, and LOGOS Corporation among others. | |
A Parallel English-Japanese Query Collection for the Evaluation of On-Line Help Systems | An experiment concerning the creation of parallel evaluation data for information retrieval is presented. A set of English queries was gathered for the domain of wordprocessing using Lotus Ami Pro. A set of Japanese queries was then created from these. The answers to the queries were elicited from eight respondents comprising four native speakers of each language. We first describe how the queries were created and the answers elicited. We then present analyses of the responses in each language. The results show a lower level of agreeement between respondents than was expected. We discuss a refinement of the elicitation process which is designed to address this problem as well as measuring the integrity of individual respondents. | |
Principled Hidden Tagset Design for Tiered Tagging of Hungarian | For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features. | |
Evaluating Wordnets in Cross-language Information Retrieval: the ITEM Search Engine | This paper presents the ITEM multilingual search engine. This search engine performs full lexical processing (morphological analysis, tagging and Word Sense Disambiguation) on documents and queries in order to provide language-neutral indexes for querying and retrieval. The indexing terms are the EuroWordNet/ITEM InterLingual Index records that link wordnets in 10 languages of the European Community (the search engine currently supports Spanish, English and Catalan). The goal of this application is to provide a way of comparing in context the behavior of different Natural Language Processing strategies for Cross-Language Information Retrieval (CLIR) and, in particular, different Word Sense Disambiguation strategies for query translation and conceptual indexing. | |
An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages | We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem. | |
Sublanguage Dependent Evaluation: Toward Predicting NLP performances | In Natural Language Processing (NLP) Evaluation, such as MUC (Hirshman, 98), TREC (Harman, 98), GRACE (Adda et al, 97), SENSEVAL (Kilgarriff98), performance results provided are often average made on the complete test set. That does not give any clues on the systems robustness. knowing which system performs better on average does not help us to find which is the best for a given subset of a language. In the present article, the existing approaches which take into account language heterogeneity and offer methods to identify sublanguages are presented. Then we propose a new metric to assess robustness and we study the effect of different sublanguages identified in the Penn Tree Bank Corpus on performance variations observed for POS tagging. The work we present here is a first step in the development of predictive evaluation methods, intended to propose new tools to help in determining in advance the range of performance that can be expected from a system on a given dataset. | |
The Universal XML Organizer: UXO | The integrated editor UXO is the result of ongoing research and development of the text-technology group at Bielefeld. Being a full featured XML-based editing system, it also allows to combine the structured annotated data with information imported from relational databases by integrating a JDBC interface. The mapping processes between different levels of annotation can be programmed either by the integrated scheme interpreter, or by extending the functionality of UXO using the predefined Java API. | |
TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation | The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or ''profiling'' within the ELRA benchmark called ''Contribution to the construction of contemporary french corpora'' based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations. | |
An Approach to Lexical Development for Inflectional Languages | We describe a method for the semi-automatic development of morphological lexicons. The method aims at using minimal pre-existing resources and only relies upon the existence of a raw text corpus and a database of inflectional classes. No lexicon or list of base forms is assumed. The method is based on a contrastive approach, which generates hypothetical entries based on evidence drawn form a corpus, and selects the best candidates by heuristically comparing the candidate entries. The reliance upon inflectional information and the use of minimal resources make this approach particularly suitable for highly inflectional, lower-density languages. A prototype tool has been developed for Modern Greek. | |
Some Language Resources and Tools for Computational Processing of Portuguese at INESC | In the last few years automatic processing tools and studies based on corpora have became of a great importance for the community. The possibility of evaluating and developing such tools and studies depends on the availability of language resources. For the Portuguese language in its several national varieties these resources are not enough to meet the community needs. In this paper some valuable resources are presented, such as a multifunctional lexicon, general-purpose lexicons for European and Brazilian Portuguese and corpus processing tools. | |
Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation | Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classification techniques in Japanese named entity recognition in the context of minimally supervised learning. This experimental evaluation demonstrates that the minimally supervised learning method proposed here improved the performance of the seed knowledge on named entity chunking and classification. We also investigated the correlation between performance of the minimally supervised learning and the sizes of the training resources such as the seed set as well as the unlabeled training data. | |
Evaluation of a Generic Lexical Semantic Resource in Information Extraction | We have created an information extraction system that allows users to train the system on a domain of interest. The system helps to maximize the effect of user training by applying WordNet to rule generation and validation. The results show that, with careful control, WordNet is helpful in generating useful rules to cover more instances and hence improve the overall performance. This is particularly true when the training set is small, where F-measure is increased from 65% to 72%. However, the impact of WordNet diminishes as the size of training data increases. This paper describes our experience in applying WordNet to this system and gives an evaluation of such an effort. | |
The Establishment of Motorola's Human Language Data Resource Center: Addressing the Criticality of Language Resources in the Industrial Setting | Within the human language technology (HLT) field it is widely understood that the availability (and effective utilization) of voluminous, high quality language resources is both a critical need and a critical bottleneck in the advancement and deployment of cutting edge HLT applications. Recently formed (inter-)national human language resource (HLR) consortia (e.g., LDC, ELRA,...) have made great strides in addressing this challenge by distributing a rich array of pre-competitive HLRs. However, HLT application commercialization will continue to demand that HLRs specific to target products (and complementary to consortially available resources) be created. In recognition of the general criticality of HLRs, Motorola has recently formed the Human Language Data Resource Center (HLDRC) to streamline and leverage our HLR creation and utilization efforts. In this paper, we use the specific case of the Motorola HLDRC to help examine the goals and range of activities which fall into the purview of a company- internal HLR organization, look at ways in which such an organization differs from (and is similar to) HLR consortia, and explore some issues with respect to implementation of a wholly within-company HLR organization like the HLDRC. | |
IPA Japanese Dictation Free Software Project | Large vocabulary continuous speech recognition (LVCSR) is an important basis for the application development of speech recognition technology. We had constructed Japanese common LVCSR speech database and have been developing sharable Japanese LVCSR programs/models by the volunteer-based efforts. We have been engaged in the following two volunteer-based activities. a) IPSJ (Information Processing Society of Japan) LVCSR speech database working group. b) IPA (Information Technology Promotion Agency) Japanese dictation free software project. IPA Japanese dictation free software project (April 1997 to March 2000) is aiming at building Japanese LVCSR free software/models based on the IPSJ LVCSR speech database (JNAS) and Mainichi newspaper article text corpus. The software repository as the product of the IPA project is available to the public. More than 500 CD-ROMs have been distributed. The performance evaluation was carried out for the simple version, the fast version, and the accurate version in February 2000. The evaluation uses 200 sentence utterances from 46 speakers. The gender-independent HMM models and 20k/60k language models are used for evaluation. The accurate version with the 2000 HMM states and 16 Gaussian mixtures shows 95.9 % word correct rate. The fast version with the phonetic tied mixture HMM and the 1/10 reduced language model shows 92.2 % word correct rate and realtime speed. The CD-ROM with the IPA Japanese dictation free software and its developing workbench will be distributed by the registration to http://www.lang.astem.or.jp/dictation-tk/ or by sending e-mail to dictation-tk-request@astem.or.jp. | |
Spontaneous Speech Corpus of Japanese | Design issues of a spontaneous speech corpus is described. The corpus under compilation will contain 800-1000 hour spontaneously uttered Common Japanese speech and the morphologically annotated transcriptions. Also, segmental and intonation labeling will be provided for a subset of the corpus. The primary application domain of the corpus is speech recognition of spontaneous speech, but we plan to make it useful for natural language processing and phonetic/linguistic studies also. | |
Annotating Resources for Information Extraction | Trained systems for NE extraction have shown significant promise because of their robustness to errorful input and rapid adaptability. However, these learning algorithms have transferred the cost of development from skilled computational linguistic expertise to data annotation, putting a new premium on effective ways to produce high-quality annotated resources at minimal cost. The paper reflects on BBN’s four years of experience in the annotation of training data for Named Entity (NE) extraction systems discussing useful techniques for maximizing data quality and quantity. | |
The New Edition of the Natural Language Software Registry (an Initiative of ACL hosted at DFKI) | In this paper we present the new version (4th edition) of the Natural Language Software Registry (NLSR), an initiative of the Association for Computational Linguistics (ACL) hosted at DFKI in Saarbr¨ ucken. We give a brief overview of the history of this repository for Natural Language Processing (NLP) software, list some related works and go into the details of the design and the implementation of the new edition. | |
Design Methodology for Bilingual Pronunciation Dictionary | This paper presents the design methodology for the bilingual pronunciation dictionary of sound reference usage, which reflects the cross-linguistic, dialectal, first language (L1) interfered, biological and allophonic variations. The design methodology features 1) the comprehensive coverage of allophonic variation, 2) concise data entry composed of a balanced distribution of dialects, genders, and ages of speakers, 3) bilingual data coverage including L1-interfered speech, and 4) eurhythmic arrangements of the recording material for temporal regularity. The recording consists of the triple way comparison of 1) English sounds spoken by native English speakers, 2) Korean sounds spoken by native Korean speakers, and 3) English sounds spoken by Korean speakers. This paper also presents 1) the quality controls and 2) the structure and format of the data. The intended usage of this “sound-based” bilingual dictionary aims at 1) cross-linguistic and acoustic research, 2) application to speech recognition, synthesis and translation, and 3) foreign language learning including exercises. | |
LEXIPLOIGISSI: An Educational Platform for the Teaching of Terminology in Greece | This paper introduces a project, LEXIPLOIGISSI * , which involves use of language resources for educational purposes. More particularly, the aim of the project is to develop written corpora, electronic dictionaries and exercises to enhance students’ reading and writing abilities in six different school subjects. It is the product of a small-scale pilot program that will be part of the school curriculum in the three grades of Upper Secondary Education in Greece. The application seeks to create exploratory learning environments in which digital sound, image, text and video are fully integrated through the educational platform and placed under the direct control of users who are able to follow individual pathways through data stores. | |
An HPSG-Annotated Test Suite for Polish | The paper presents both conceptual and technical issues related to the construction of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish — both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it. | |
The COST 249 SpeechDat Multilingual Reference Recogniser | The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.95 of the reference recogniser and presents results on small and medium vocabulary recognition for five languages. | |
Terminology Encoding in View of Multifunctional NLP Resources | Given the existing standards for organising terminology resources, the main question raised is how to create a DB or assimilated term list with properties allowing for an efficient NLP treatment of input texts. Here, we have dealt with the output of MT and have attempted to improve terminological annotation of the input text, in order to optimize reusability and efficiency of performance. By organizing terms in BD-like tables, which provide various cross-linked indications about head properties, morpho-syntax, derivational morphology and semantic-pragmatic relations between concepts of terms, we have managed to improve functionality of resources and enable better customisation. Moreover, we have tried to view the proposed term DB organisation as part of a global account of the problem of terminology resolution on-processing via grammar based or user-machine interaction techniques for term recognition and disambiguation, since term boundary definition is generally recognised to be a complex and costly enterprise, directly related to the fact that most problem causing terminology items are multi-word units either characterized as fixed or as ad hoc or not yet fixed terms. | |
Terminology in Korea: KORTERM | Korterm (Korea Terminology Research Center for Language and Knowledge Engineering) had been set up in the late August of 1998 under the auspices of Ministry of Culture and Tourism in Korea. Its major mission is to construct terminology resources and their unification, harmonization and standardization. This mission is naturally linked to the general language engineering and knowledge engineering tasks including specific-domain corpus, ontology, wordnet, electronic dictionary construction as well as language engineering products like information extraction and machine translation. This organization is located under the KAIST (Korea Advanced Institute of Science and Technology) that is one of national university specifically under the Ministry of Science and Technology. KORTERM is only one representative for terminology standardization and research with relation to Infoterm. | |
Morphological Tagging to Resolve Morphological Ambiguities | The issue of this paper is to present the advantages of a morphological tagging of English in order to resolve morphological ambiguities. Such a way of tagging seems to be more efficient because it allows an intention description of morphological forms compared with the extensive collection of usual dictionaries. This method has already been experimented on French and has given promising results. It is very relevant since it allows both to bring hidden morphological rules to light which are very useful especially for foreign learners and take lexical creativity into account. Moreover, this morphological tagging was conceived in relation to the subsequent disambiguation which is mainly based on local grammars. The purpose is to create a morphological analyser being easily adaptable and modifiable and avoiding the usual errors of the ordinary morphological taggers linked to dictionaries. | |
An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research | In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface. | |
GeDeriF: Automatic Generation and Analysis of Morphologically Constructed Lexical Resources | One of the major frequent problems in text retrieval comes from large number of words encountered which are not listed in general language dictionaries. However, it is very often the case that these words are morphologically complex, and as such have a meaning which is predictable on the basis of their structure. Furthermore, such words typically belong to specialized language uses (e.g. scientific, philosophical or media technolects). Consequently, tools for listing and analysing such words can help enrich a terminological database. The purpose of this paper is to present a system that automatically generates morphologically complex lexical French items which are not listed in dictionaries, and that furthermore provides a structural and semantic analysis of these items. The output of this system is a morphological database (currently in progress) which forms a powerful lexical resource. It will be very useful in Natural Language Processing (NLP) and in IR (Information Retrieval) applications. Indeed the system generates a potentially infinite set of complex (derived) lexical units (henceforth CLUs) automatically associated with a rich array of morpho-semantic features, and is thus capable of dealing morphologically complex structures which are unlisted in dictionaries. | |
Le Programme Compalex (COMPAraison LEXicale) | ||
Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies | This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully. | |
Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future | In this paper we analyse the various problems in making multilingual terminological resources available to users. Different levels of diversity and incongruence among such resources are discussed. Previous standardization efforts are reviewed. As a solution to the lack of co-ordination and compatibility among an increasing number of ‘standard’ interchange formats, a higher level of integration is proposed for the purpose of terminology-enabled knowledge sharing. The family of formats currently being developed in the SALT project is presented as a contribution to this solution. | |
Using a Formal Approach to Evaluate Grammars | In this paper, we present a methodological formal approach to evaluate grammars based on a unified representation. This approach uses two kinds of criteria. The first one considers a grammar as a resource enabling the representation of particular aspects of a given language. The second is interested in using grammars in the development of lingware. The evaluation criteria are defined in a formal way. In addition, we indicate for every criterion how it would be applied. | |
Design Issues in Text-Independent Speaker Recognition Evaluation | We discuss various considerations that have been involved in designing the past five annual NIST speaker recognition evaluations. These text-independent evaluations using conversational telephone speech have attracted state-of-the- art automatic systems from research sites around the world. The availability of appropriate data for sufficiently large test sets has been one key design consideration. There have also been variations in the specific task efinitions, the amount and type of training data provided, and the durations of the test segments. The microphone types of the handsets used, as well as the match or mismatch of training and test handsets, have been found to be important considerations that greatly affect system performance. | |
Developing Guidelines and Ensuring Consistency for Chinese Text Annotation | With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public summer 2000. In this paper, we will address several challenges in building the corpus, namely, creating annotation guidelines, ensuring annotation accuracy and maintaining a high level of community involvement. | |
Corpora of Slovene Spoken Language for Multi-lingual Applications | The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools (EAGLES Handbook 1997). This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics. | |
GRUHD: A Greek database of Unconstrained Handwriting | In this paper we present the GRUHD database of Greek characters, text, digits, and other symbols in unconstrained handwriting mode. The database consists of 1,760 forms that contain 667,583 handwritten symbols and 102,692 words in total, written by 1,000 writers, 500 men and equal number of women. Special attention was paid in gathering data from writers of different age and educational level. The GRUHD database is accompanied by the GRUHD software that facilitates its installation and use and enables the user to extract and process the data from the forms selectively, depending on the application. The various types of possible installations make it appropriate for the training and validation of character recognition, character segmentation and text-dependent writer identification systems. | |
Labeling of Prosodic Events in Slovenian Speech Database GOPOLIS | The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classi-fication of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction. | |
NL-Translex: Machine Translation for Dutch | NL-Translex is an MLIS-project which is funded jointly by the European Commission, the Dutch Language Union, the Ducth Ministry of Education, Culture and Science, the Dutch Ministry of Economic Affairs and the Flemish Institute for the Promotion of Scientific and Technological Research in Industry. The aim of this project is to develop Machine Translation components that will handle unrestricted text and translate Dutch from and into English, French and German. In addition to this practical aim, the partners in this project all have objectives relating to strategy, language policy and culture. The modules to be developed are intended primarily for use by EU institutions and the translation services of official bodies in the Member States. In this paper we describe in detail the aims and structure of the project, the user population, the available resources and the activities carried out so far, in particular the procedure followed for the call for tenders aimed at selecting a technonolgy provider. Finally, we describe the acceptance procedure, the strategic impact of the project and the dissemination plan. | |
Rarity of Words in a Language and in a Corpus | A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared. | |
Language Resources Development at the Spanish Royal Academy | This paper explains some of the most relevant issues concerning the development of language resources at the Spanish Royal Academy. Two 125-M words corpus of Spanish language (synchronic and diachronic) and three specialized corpus has been developed. Around the corpus, RAE is also developing NLP tools and resources to morpho-syntactically annotate them. Some of the most relevant are: The Computational Lexicon, the Morphological analysis tools, the Disambiguation grammars and the Tokenizer generator. The last section describes the lexicographic use of corpus materials and includes a brief description of the Corpus-based lexicographical workbench and his related tools. | |
Reusability as Easy Adaptability: A Substantial Advance in NL Technology | The design and implementation of new applications in NLP at low costs mostly depends upon the availability of technologies oriented to the solution of any specific problem. The success of this task, besides the use of widely agreed formats and standards, relies upon at least two families of tools, those for managing and updating, and those for projecting an ''application view-point'' onto the data in the repository. This approach has different realizations if applied to a dictionary, a corpus, or a grammar. Some examples, taken frrom European and other industrial projects, show that reusability: a) in the building of industrial prototypes consists in the easy reconfiguration of resources (dictionary and grammar), easy portability and easy recombination of tools, by means of simple APIs, as well as on different implementation platforms: b) in the building of advanced applications still consists in the same features, together with the possibility of opening different view-points on dictionaries and grammars. | |
Looking for Errors: A Declarative Formalism for Resource-adaptive Language Checking | The paper describes a phenomenon-based approach to grammar checking, which draws on the integration of different shallowNLP technologies, including morphological and POS taggers, as well as probabilistic and rule-based partial parsers. We present a declarative specification formalism for grammar checking and controlled language applications which greatly facilitates the development of checking components. | |
The Bank of Swedish | The Bank of Swedish is described: affiliation, organisation, linguistic resources and tools. A point is made of the close connection between lexical research and corpus data, the broad textual coverage from Modern Swedish to Old Swed-ish, the official status of the organisation and its connection to Goteborg University. The relation to the broader scope of the comprehensive Language Database of Swedish is discussed. A few current issues of the Bank of Swedish are presented: parallell corpora production, the construction of a Swedish morphology database and sense tagging of text corpora. Finally, the updating of the Bank of Swedish concordance system is mentioned. | |
Automatic Style Categorisation of Corpora in the Greek Language | In this article, a system is proposed for the automatic style categorisation of text corpora in the Greek language. This categorisation is based to a large extent on the type of language used in the text, for example whether the language used is representative of formal Greek or not. To arrive to this categorisation, the highly inflectional nature of the Greek language is exploited. For each text, a vector of both structural and morphological characteristics is assembled. Categorisation is achieved by comparing this vector to given archetypes using a statistical-based method. Experimental resu | |
Automatic Extraction of Semantic Similarity of Words from Raw Technical Texts | In this paper we address the problem of extracting semantic similarity relations between lexical entities based on context similarities as they appear in specialized text corpora. Only general-purpose linguistic tools are utilized in order to achieve portability across domains and languages. Lexical context is extended beyond immediate adjacency but is still confined by clause boundaries. Morfological and collocational information are employed in order to exploit the most of the contextual data. The extracted semantic similarity relations are transformed to semantic clusters which is a primal form of a domain-specific term thesaurus. | |
Predictive Performance of Dialog Systems | This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients. | |
Automatic Generation of Dictionary Definitions from a Computational Lexicon | This paper presents an automatic Generator of dictionary definitions for concrete entities, based on information extracted from a Computational Lexicon (CL) containing semantic information. The aim of the adopted approach, combining NLG techniques with the exploitation of the formalised and systematic lexical information stored in CL, is to produce well formed dictionary definitions free from the shortcomings of traditional dictionaries. The architecture of the system is presented, focusing on the adaptation of the NLG techniques to the specific application requirements, and on the interface between the CL and the Generator. Emphasis is given on the appropriateness of the CL for the application purposes. | |
Regional Pronunciation Variants for Automatic Segmentation | The goal of this paper is to create an extended rule corpus with approximately 2300 phonetic rules which model segmental variation of regional variants of German. The phonetic rules express at a broad-phonetic level phenomena of phonetic reduction in German that occurs within words and across word boundaries. In order to get an improvement in automatic segmentation of regional speech variants, these rules are clustered and implemented depending on regional specification in the Munich Automatic Segmentation System. | |
SegWin: a Tool for Segmenting, Annotating, and Controlling the Creation of a Database of Spoken Italian Varieties | A number of actions have been recently proposed, aiming at filling the gap existing in the availability of speech annotated corpora of Italian regional varieties. A starting action is represented by the national project AVIP (Archivio delle Varieta di Italiano Parlato, Spoken Italian Varieties Archive), whose main challenge is a methodological one, namely finding annotation strategies and developing suitable software tools for coping with the inadequacy of linguistic models for Italian accent variations. Basically, these strategies consist in adopting an iterative process of labelling such that a description for each variety could be achieved by successive refinement stages without loosing intermediate stages information. To satisfy such requirements, a specific software system, called SegWin, has been developed by Politecnico di Bari, which: • “guides” the human transcribers in the annotation phases by a sort of “scheduled procedure”; • allows incremental addition of information at any stage of the database creation; • monitors/checks the consistency of the database during every stage of its creation The system has been extensively used by all the partners of the project AVIP and is continuously updated to take into account the project needs. The main characteristics of SegWin are here described, in relation to the above mentioned aspects. | |
Automotive Speech-Recognition - Success Conditions Beyond Recognition Rates | From a car-manufacturer’s point of view it is very important to integrate evaluation procedures into the MMI development process. Focusing the usability evaluation of speech-input and speech-output systems aspects beyond recognition rates must be fulfilled. Two of these conditions will be discussed based upon user studies conducted in 1999: • Mental-workload and distraction • Learnability | |
The ISLE Corpus of Non-Native Spoken English | For the purpose of developing pronunciation training tools for second language learning a corpus of non-native speech data has been collected, which consists of almost 18 hours of annotated speech signals spoken by Italian and German learners of English. The corpus is based on 250 utterances selected from typical second language learning exercises. It has been annotated at the word and the phone level, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments. The data has been used to develop and evaluate several diagnostic components, which can be used to produce corrective feedback of unprecedented detail to a language learner. | |
A Graphical Parametric Language-Independent Tool for the Annotation of Speech Corpora | Robust speech recognizers and synthesizers require well-annotated corpora in order to be trained and tested, thus making speech annotation tools crucial in speech technology. It is very important that these tools are parametric so that they can handle various directory and file structures and deal with different waveform and transcription formats. They should also be language-independent, provide a user-friendly interface or even interact with other kinds of speech processing software. In this paper we describe an efficient tool able to cope with the above requirements. It was first developed for the annotation of the SpeechDat-II recordings, and then it was extended to incorporate the additional features of the SpeechDat-Car project. Nevertheless, it has been parameterized so that it is not restricted to the SpeechDat format and Greek, and it can handle any other formalism and language. | |
The PAROLE Program | The PAROLE project (Contract LE2-4017) was launched in May 1996 by the Commission of the European Communities, at the initiative of the DG XIII (Telecommunications, Information Market and Exploitation of Research). PAROLE is not just a project for gathering and creating a corpus. We are creating a true architectural model whose riches and quality will constitute strategic assets for European linguistic studies. This two-level architecture will link together two major morphological and syntactical component. | |
For a Repository of NLP Tools | In this paper, we assume that the perspective which consists of identifying the NLP supply according to its different uses gives a general and efficient framework to understand the existing technological and industrial offer in a user-oriented approach. The main feature of this approach is to analyse how a specific technical product is really used by the users and not only to highlight how the developers expect the product to be used. To achieve this goal with NLP products, we first need to have a clear and quasi-exhaustive picture of the technical and industrial supply. During the 1998-1999 period, the European Language Resources Association (ELRA) conducted a study funded by the French Ministry of Research and Higher Education to produce a directory of language engineering tools and resources for French. In this paper, we present the main results of the study. The first part gives some information on the methodology adopted to conduct the study, the second part presents the main characteristics of the classification and the third part gives an overview of the applications which have been identified. | |
Survey of Language Engineering Needs: a Language Resources Perspective | This paper describes the current state of an on-going survey that aims at determining the needs of users with respect to available and potentially available Language Resources (LRs). Following market monitoring strategies that have been outlined within the Language Resources- Packaging and Production project (LRsP&P LE4-8335), the main objective of this survey is to provide concrete figures for developing a more reliable and workable business plan for the European Language Resources Association (ELRA) and its Distribution Agency (ELDA), and to determine investment plans for sponsoring the production of new resources. | |
Interarbora and Thistle - Delivering Linguistic Structure by the Internet | I describe an Internet service ''Interarbora'', which facilitates the visualization of tree structures. The service is built on top of a general purpose editor ''Thistle'', which allows the editing of diagrams and the generation of print format representations. | |
Automatically Augmenting Terminological Lexicons from Untagged Text | Lexical resources play a crucial role in language technology but lexical acquisition can often be a time-consuming, laborious and costly exercise. In this paper, we describe a method for the automatic acquisition of technical terminology from domain restricted texts without the need for sophisticated natural language processing tools, such as taggers or parsers, or text corpora annotated with labelled cases. The method is based on the idea of using prior or seed knowledge in order to discover co-occurrence patterns for the terms in the texts. A bootstrapping algorithm has been developed that identifies patterns and new terms in an iterative manner. Experiments with scientific journal abstracts in the biology domain indicate an accuracy rate for the extracted terms ranging from 58% to 71%. The new terms have been found useful for improving the coverage of a system used for terminology identification tasks in the biology domain. | |
Annotating Events and Temporal Information in Newswire Texts | If one is concerned with natural language processing applications such as information extraction (IE), which typically involve extracting information about temporally situated scenarios, the ability to accurately position key events in time is of great importance. To date only minimal work has been done in the IE community concerning the extraction of temporal information from text, and the importance, together with the difficulty of the task, suggest that a concerted effort be made to analyse how temporal information is actually conveyed in real texts. To this end we have devised an annotation scheme for annotating those features and relations in texts which enable us to determine the relative order and, if possible, the absolute time, of the events reported in them. Such a scheme could be used to construct an annotated corpus which would yield the benefits normally associated with the construction of such resources: a better understanding of the phenomena of concern, and a resource for the training and evaluation of adaptive algorithms to automatically identify features and relations of interest. We also describe a framework for evaluating the annotation and compute precision and recall for different responses. | |
Chinese-English Semantic Resource Construction | We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www .how-net.com), and a large machine-readable dic-tionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval. | |
Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries | In this paper, the main features of manually produced bilingual dictionaries, which have been originally designed for human use, are considered. The problem is to find the way to use such kind of dictionaries in order to produce bilingual language resources that could make a base for automate text processing, such as machine translation, cross-lingual interrogation in text retrieval, etc. The transformation technology suggested hereby is based on XML-parsing of the file obtained from the source data by means of serial of special procedures. In order to produce well-formed XML-file, automatic procedures suffice. But in most cases, there are still semantic problems and inconveniencies that could be retired only in interactive way. However, the volume of this work can be minimized due to automatic pre-editing and suitable XML mark-up. The paper presents the results of R&D project which was carried out in the framework of ELRA’1999 Call for proposals on Language resources Production. The paper is based on the authors’ experience with English-Russian and French-Russian dictionaries, but the technology can be applied to other pairs of languages. | |
Developing a Multilingual Telephone Based Information System in African Languages | This paper introduces the first project of its kind within the Southern African language engineering context. It focuses on the role of idiosyncratic linguistic and pragmatic features of the different languages concerned and how these features are to be accommodated within (a) the creation of applicable speech corpora and (b) the design of the system at large. An introduction to the multilingual realities of South Africa and its implications for the development of databases is followed by a description of the system and different options that may be implemented in the system. | |
Tuning Lexicons to New Operational Scenarios | In this paper the role of the lexicon within typical application tasks based on NLP is analysed. A large scale semantic lexicon is studied within the framework of a NLP application. The coverage of the lexicon with respect the target domain and a (semi)automatic tuning approach have been evaluated. The impact of a corpus-driven inductive architecture aiming to compensate lacks in lexical information are thus measured and discussed. | |
SpeechDat-Car Fixed Platform | SpeechDat-Car aims to develop a set of speech databases to support training and testing of multilingual speech recognition applications in the car environment. Two types of recordings compose the database. The first type consist of wideband audio signals recorded directly in the car while the second type is composed by GSM signals transmitted from the car and recorded simultaneously in a far-end. Therefore, two recording platforms were used, a ‘mobile’ recording platform installed inside the car and a ‘fixed’ recording platform located at the far-end fixed side of the GSM communications system. This paper describes the fixed platform software developed by the Universitat Politecnica de Catalunya (ADA-K). This software is able to work with standard inexpensive PC cards for ISDN lines. | |
Inter-annotator Agreement for a German Newspaper Corpus | This paper presents the results of an investigation on inter-annotator agreement for the NEGRA corpus, consisting of German newspaper texts. The corpus is syntactically annotated with part-of-speech and structural information. Agreement for part-of-speech is 98.6%, the labeled F-score for structures is 92.4%. The two annotations are used to create a common final version by discussing differences and by several iterations of cleaning. Initial and final versions are compared. We identify categories causing large numbers of differences and categories that are handled inconsistently. | |
Interactive Corpus Annotation | We present an easy-to-use graphical tool for syntactic corpus annotation. This tool, Annotate, interacts with a part-of-speech tagger and a parser running in the background. The parser incrementally suggests single phrases bottom-up based on cascaded Markov models. A human annotator confirms or rejects the parser’s suggestions. This semi-automatic process facilitates a very rapid and efficient annotation. | |
The Concede Model for Lexical Databases | The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages. Starting with sample entries from original presentation-oriented electronic representations of dictionaries, we transformed the data into an intermediate TEI-compatible represen-tation to provide a common baseline for evaluating and comparing the dictionaries. We then developed a more restrictive encoding, formalised as an XML DTD with a clearly-defined semantic interpretation. We present this DTD and discuss a sample conversion from TEI, together with an application which hyperlinks a HTML representation of the dictionary to on-line concordancing over a corpus. | |
Design and Implementation of the Online ILSP Greek Corpus | This paper presents the Hellenic National Corpus (HNC), which is the corpus of Modern Greek developed by the Institute for Language and Speech Processing (ILSP). The presentation describes all stages of the creation of the corpus: collection of the material, tagging and tokenizing, construction of the database and the online implementation which aims at rendering the corpus accessible over Internet to the research community. | |
A Software Toolkit for Sharing and Accessing Corpora Over the Internet | This paper describes the Translational English Corpus (TEC) and the software tools developed in order to enable the use of the corpus remotely, over the internet. The model underlying these tools is based on an extensible client-server architecture implemented in Java. We discuss the data and processing constraints which motivated the TEC architecture design and its impact on the efficiency and scalability of the system. We also suggest that the kind of distributed processing model adopted in TEC could play a role in fostering the availability of corpus linguistic resources to the research community. | |
Tools for the Generation of Morphological Entries in Dictionaries | he lexicographer's tool introduced in the report represents a semiautomatic system to generate the section of morphological information for Estonian words in dictionary entries. Estonian is a language with a complicated morphology featuring (1) rich inflection and (2) marked and diverse morpheme variation, applying both to stems and formatives. The kernel of the system is a rule-based automatic morphology with separate program modules for every linguistic subsystem such as syllabification, recognition of part of speech and type of inflection, stem variation, morpheme and allomorph combinatorics. The modules function as rule interpreters applying formal grammars in an editable text format. The system enables generation of the following: (1) part of speech, (2) type of inflection, (3) inflected forms, (4) morphonological marking: degree of quantity, morpheme boundaries (stem+formative, component boundaries in compounds), (5) morphological references for inflected forms considerably different from the headword. The system permits of set-up, so that the inflected forms to be generated, the style of morphonological marking and the criteria for reference selection are all up to the user to choose. Full automation of the system application is restricted mainly by morphological homonymy. | |
Improving Lexical Databases with Collocational Information: Data from Portuguese | This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can be conceived through relations linking the lexical components. Mechanisms for dealing with the collocation-hood of the expressions are required to be included in the systems, topographically, in their lexical modules. Lexical databases like wordnets, with a general architecture typically structured on semantic relations, make room for the specification of this phenomenon. This can be handled through the definition of ad-hoc relations expressing the different semantic effects the adjectival modification bring to nominal phrases, collocationally. | |
Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model | In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable parse tree. | |
Issues from Corpus Analysis that have influenced the On-going Development of Various Haitian Creole Text- and Speech-based NLP Systems and Applications | This paper describes issues that are relevant to using small- to large-sized corpora for the training and testing of various text- and speech-based natural language processing (NLP) systems for minority and vernacular languages. These R&D and commercial systems and applications include machine translation, orthography conversion, optical character recognition, speech recognition, and speech synthesis that have already been produced for the Haitian Creole (HC) language. Few corpora for minority and vernacular languages have been created specifically for language resource distribution and for NLP system training. As a result, some of the only available corpora are those that are produced within real end-user environments. It is therefore of utmost importance that written language standards be created and then observed so that research on various text- and speech-based systems can be fruitful. In doing so, this also provides vernacular and minority languages with the opportunity to have an impact within the globalization and advanced communication needs efforts of the modern day world. Such technologies can significantly influence the status of these languages, yet the lack of standardization is a severe impediment to technological development. A number of relevant issues are discussed in this paper. | |
NaniTrans: a Speech Labelling Tool | This paper deals with a description of NaniTrans, a tool for segmentation and labeling of speech. The tool is programmed to work on the MATLAB application interface, in any of the supported platforms (Unix, Windows, Macintosh). The tool has been designed to annotate large speech databases, which can be also partially preprocessed (but require manual supervision). It supports the definition of an environment of annotation: set of annotation levels (orthographic, phonetic, etc.), display mode (how to show information), graphic representation (waveform, spectrogram), keyboard short-cuts, etc. This configuration is then used on a speech database. A safe file locking system allows many annotators to work concurrently on the same speech database. The tool is very friendly and easy to use by non experienced annotators, and it is designed to optimize speed using both keyboard and mouse. New options or speech processing tools can be easily added by using any MATLAB or user defined function. | |
Acquisition of Linguistic Patterns for Knowledge-based Information Extraction | In this paper we present a new method of automatic acquisition of linguistic patterns for Information Extraction, as implemented in the CICERO system. Our approach combines lexico-semantic information available from the WordNet database with collocating data extracted from training corpora. Due to the open-domain nature of the WordNet information and the immediate availability of large collections of texts, our method can be easily ported to open-domain Information Extraction. | |
A Platform for Dutch in Human Language Technologies | As ICT increasingly forms a part of our daily life it becomes more and more important that all citizens can make use of their native languages in all communicative situations. For the development of successful applications and products for Dutch basic provisions are required. The development of the basic material that is lacking, is an expensive undertaking which exceeds the capacity of the individuals involved. Collaboration between the various agents (policy, knowledge infrastructure and industry) in the Netherlands and Flanders is required. The existence of the Dutch Language Union (Nederlandse Taalunie) facilitates this co-operation. The responsible ministers decided to set up a Dutch-Flemish platform for Dutch in Human Language Technologies. The purpose of the platform is the further construction of an adequate digital language infrastructure for Dutch so that the industry develops the required applications which must guarantee that the citizens in Holland and Flanders can use their own language in their communication within the information society and the Dutch language area remains a full player in a multi-lingual Europe. This paper will show some of the efforts that have been taken | |
Developing and Testing General Models of Spoken Dialogue System Peformance | The design of methods for performance evaluation is a major open research issue in the area of spoken language dialogue systems. This paper presents the PARADISE methodology for developing predictive models of spoken dialogue performance, and shows how to evaluate the predictive power and generalizability of such models. To illustrate the methodology, we develop a number of models for predicting system usability (as measured by user satisfaction), based on the application of PARADISE to experimental data from two different spoken dialogue systems. We compare both linear and tree-based models. We then measure the extent to which the models generalize across different systems, different experimental conditions, and different user populations, by testing models trained on a subset of the corpus against a test set of dialogues. The results show that the models generalize well across the two systems, and are thus a first approximation towards a general performance model of system usability. | |
Using Few Clues Can Compensate the Small Amount of Resources Available for Word Sense Disambiguation | Word Sense Disambiguation (WSD) is considered as one of the most difficult tasks in Natural Language Processing. Probabilistic methods have shown their efficiency in many NLP tasks, but they imply a training phase and very few resources are available for WSD. This paper aims at showing how to make the most of size-limited resources in order to partially overcome the knowledge acquisition bottleneck. Experiments are performed within the SENSEVAL test framework in order to evaluate the advantage of a lemmatized or stemmed context over an original context (inflected forms as they are observed in the rough text). Then, we measure the precision improvement (about 6 %) when looking at the inflected form of the word to be disambiguated. Lastly, we show that it is possible to reduce the ambiguity if the word to be disambiguated has a particular inflected form or occurs as part of a compound. | |
Modern Greek Corpus Taxonomy | The aim of this paper is to explore the way in which different kind of linguistic variables can be used in order to discriminate text type in 240 preclassified press texts. Modern Greek (MG) language due to its past diglossic status exhibits extended variation in written texts across all linguistic levels and can be exploited in text categorization tasks. The research presented used Discriminant Function Analysis (DFA) as a text categorization method and explores the way different variable groups contribute to the text type discrimination. | |
Language Resources as by-Product of Evaluation: The MULTITAG Example | In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the by-products of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating sys-tems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MULTITAG project funded by CNRS has built from the by-products of the GRACE evaluation campaign (French Part-Of-Speech tagging system evaluation campaign) a corpus of around 1 million words, annotated with a fine grained tagset derived from the EAGLES and MULTEXT projects. A brief presentation of the state of the art in Part-Of-Speech (POS) tagging and of the problem posed by its evaluation is given at the beginning, then the corpus itself is presented along with the procedure used to produce and validate it. In particular, the cost reduction brought by using this method instead of more classical methods is presented and its generalization to other control task is discussed in the conclusion. | |
Evaluation of Computational Linguistic Techniques for Identifying Significant Topics for Browsing Applications | Evaluation of natural language processing tools and systems must focus on two complementary aspects: first, evaluation of the accuracy of the output, and second, evaluation of the functionality of the output as embedded in an application. This paper presents evaluations of two aspects of LinkIT, a tool for noun phrase identification linking, sorting and filtering. LinkIT [Evans 1998] uses a head sorting method [Wacholder 1998] to organize and rank simplex noun phrases (SNPs). LinkIT is to identify significant topics in domain-independent documents. The first evaluation, reported in D.K.Evans et al. 2000 compares the output of the Noun Phrase finder in LinkIT to two other systems. Issues of establishing a gold standard and criteria for matching are discussed. The second evaluation directly concerns the construction of the browsing application. We present results from Wacholder et al. 2000 on a qualitative evaluation which compares three shallow processing methods for extracting index terms, i.e., terms that can be used to model the content of documents. We analyze both quality and coverage. We discuss how experimental results such as these guide the building of an effective browsing applications. | |
Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition | This paper reports on a project for collection of the sound scene data. The sound scene data is necessary for studies such as sound source localization, sound retrieval, sound recognition and hands-free speech recognition in real acoustical environments. There are many kinds of sound scenes in real environments. The sound scene is denoted by sound sources and room acoustics. The number of combination of the sound sources, source positions and rooms is huge in real acoustical environments. However, the sound in the environments can be simulated by convolution of the isolated sound sources and impulse responses. As an isolated sound source, a hundred kinds of non-speech sounds and speech sounds are collected. The impulse responses are collected in various acoustical environments. In this paper, progress of our sound scene database project and application to environment sound recognition are described. | |
Using Lexical Semantic Knowledge from Machine Readable Dictionaries for Domain Independent Language Modelling | Machine Readable Dictionaries (MRDs) have been used in a variety of language processing tasks including word sense disambiguation, text segmentation, information retrieval and information extraction. In this paper we describe the utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications. A semantic model of language has been derived using the dictionary definitions in order to compute the semantic association between the words. The model is capable of capturing phenomena of latent semantic dependencies between the words in texts and reducing the language ambiguity by a considerable factor. The results of experiments suggest that the semantic model can improve the word recognition rates in “noisy-channel” applications. This research provides evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling. | |
Annotation of a Multichannel Noisy Speech Corpus | This paper describes the activity of annotation of an Italian corpus of in-car speech material, with specific reference to the JavaSgram tool, developed with the purpose of annotating multichannel speech corpora. Some pre/post processing tools used with JavaSgram are briefly described together with a synthetic description of the annotation criteria which were adopted. The final objective is that of using the resulting corpus for training and testing a hands-free speech recognizer under development. | |
ARISTA Generative Lexicon for Compound Greek Medical Terms | A Generative Lexicon for Compound Greek Medical Terms based on the ARISTA method is proposed in this paper. The concept of a representation independent definition-generating lexicon for compound words is introduced in this paper following the ARISTA method. This concept is used as a basis for developing a generative lexicon of Greek compound medical terminology using the senses of their component words expressed in natural language and not in a formal language. A Prolog program that was implemented for this task is presented that is capable of computing implicit relations between the components words in a sublanguage using linguistic and extra linguistic knowledge. An extra linguistic knowledge base containing knowledge derived from the domain or microcosm of the sublanguage is used for supporting the computation of the implicit relations. The performance of the system was evaluated by generating possible senses of the compound words automatically and judging the correctness of the results by comparing them with definitions given in a medical lexicon expressed in the language of the lexicographer. | |
A Self-Expanding Corpus Based on Newspapers on the Web | A Unix-based system is presented which automatic collects newspaper articles from the web, converts the texts, and includes these texts in a newspaper corpus. This corpus can be searched from a web-browser. The corpus is currently 70 millions words and increases by 4 millions words each month. | |
A Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts | A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The fourth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts. | |
COCOSDA - a Progress Report | This paper presents a review of the activities of COCOSDA, the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output. COCOSDA has a history of innovative actions which spawn national and regional consortia for the co-operative development of speech corpora and for the promotion of research in related topics. COCOSDA has recently undergone a change of organisation in order to meet the developing needs of the speech- and language-processing technologies and this paper summarises those changes. | |
The Treatment of Adjectives in SIMPLE: Theoretical Observations | This paper discusses the issues that play a part in the characterization of adjectival meaning. It describes the SIMPLE ontology for adjectives and provides insight into the morphological, syntactic and semantic aspects that are included in the SIMPLE adjectival templates. | |
Cardinal, Nominal or Ordinal Similarity Measures in Comparative Evaluation of Information Retrieval Process | Similarity measures are used to quantify the resemblance of two sets. Simplest ones are calculated by ratios of the document's number of the compared sets. These measures are simple and usually employed in first steps of evaluation studies, they are called cardinal measures. Others measures compare sets upon the number of common documents they have. They are usually employed in quantitative information retrieval evaluations, some examples are Jaccard, Cosine, Recall or Precision. These measures are called nominal ones. There are more or less adapted in function of the richness of the information system's answer. Indeed, in the past, they were sufficient because answers given by systems were only composed by an unordered set of documents. But usual systems improve the quality or the visibility of there answers by using a relevant ranking or a clustering presentation of documents. In this case, similarity measures aren't adapted. In this paper we present some solutions in the case of totally ordered and partially ordered answer. | |
Evaluating Multi-party Multi-modal Systems | The MITRE Corporation ’s Evaluation Working Group has developed a methodology for evaluating multi-modal groupware systems and capturing data on human-human interactions.The methodology consists of a framework for describing collaborative systems, scenario-based evaluation approach,and evaluation metrics for the various components of collaborative systems.We designed and ran two sets of experiments to validate the methodology by evaluating collaborative systems.In one experiment,we compared two configurations of a multi-modal collaborative application using a map navigation scenario requiring information sharing and decision making.In the second experiment,we pplied the evaluation methodology to a loosely integrated set of collaborative tools,again using a scenario-based approach.In both experiments,multi-modal,multi-user data were collected,visualized,annotated,and analyzed. | |
Extension and Use of GermaNet, a Lexical-Semantic Database | This paper describes GermaNet, a lexical-semantic network and on-line thesaurus for the German language, and outlines its future extension and use. GermaNet is structured along the same lines as the Princeton WordNet (Miller et al., 1990; Fellbaum, 1998), encoding the major semantic relations like synonymy, hyponymy, meronymy, etc. that hold among lexical items. Constructing semantic networks like GermaNet has become very popular in recent approaches to computational lexicography, since wordnets constitute important language resources for word sense disambiguation, which is a prerequisite for various applications in the field of natural language processing, like information retrieval, machine translation and the development of different language-learning tools. | |
Russian Monitor Corpora: Composition, Linguistic Encoding and Internet Publication | The LinGO (Linguistic Grammars Online) project’s English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingo.stanford.edu). Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels. | |
An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG | The LinGO (Linguistic Grammars Online) project's English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingo.stanford.edu). Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels. | |
Hua Yu: A Word-segmented and Part-Of-Speech Tagged Chinese Corpus | As the outcome of a 3-year joint effort of Department of Computer Science, Tsinghua University and Language Information Processing Institute, Beijing Language and Culture University, Beijing, China, a word-segmented and part-of-speech tagged Chinese corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firstly introduces some basics about HuaYu in brief, as its genre distribution, fundamental considerations in designing it, word segmentation and part-of-speech tagging standards. Then the complete list of tag set used in HuaYu is given, along with typical examples for each tag accordingly. Several pieces of annotated texts in each genre are also included at last for reader's reference. | |
SPEECHDAT-CAR. A Large Speech Database for Automotive Environments | The aims of the SpeechDat-Car project are to develop a set of speech databases to support training and testing of multilingual speech recognition applications in the car environment. As a result, a total of ten (10) equivalent and similar resources will be created. The 10 languages are Danish, British English, Finnish, Flemish/Dutch, French, German, Greek, Italian, Spanish and American English. For each language 600 sessions will be recorded (from at least 300 speakers) in seven characteristic environments (low speed, high speed with audio equipment on, etc.). This paper gives an overview of the project with a focus on the production phases (recording platforms, speaker recruitment, annotation and distribution). | |
Addizionario: an Interactive Hypermedia Tool for Language Learning | In this paper we present the hypermedia linguistic laboratory ''Addizionario'', an open and flexible software tool aimed at studying Italian either as native or as foreign language. The product is directed to various categories of users: school children who can perform in a pleasant and appealing manner various tasks generally considered difficult and boring, such as dictionary look-up, word definition and vocabulary expansion; teachers who can use it to prepare didactic units specifically designed to meet the needs of their students; psychologists and therapists who can use it as an aid to detect impaired development and learning in the child; and editors of children’s dictionaries who can access large quantities of material for the creation of attractive, easy-to-use tools which take into account the capacities, tastes and interests of their users. | |
Recent Developments within the European Language Resources Association (ELRA) | The main achievement of ELRA (the most visible) is the growth of its catalogue. The ELRA catalogue as of April 2000 lists 111 speech resources, 50 monolingual lexica, 113 multilingual lexica, 24 written corpora and 275 terminological databases. However, many Language Resources (LRs) need to be identified and/or produced. To this effect, ELRA is active in promoting and funding the co-production of new LRs through several calls for proposals. As for the validity of the existence of ELRA for the distribution of language resources, the statistics from the past two years speak for themselves. The 1999 fiscal report showed a rise with the sale of 217 LRs (122 for research and 95 for commercial purposes; with speech databases representing nearly 45%), compared to the sale of 180 LRs (90 for research and 90 for commercial purposes; with speech databases representing nearly 65%), in 1998 and to 33 sold in 1997. The other visible action of ELRA is its membership drive: since its foundation, ELRA has attracted an increasing number of members (from 63 in 1995 to 95 in 1999). This article is updated from a paper presented at Eurospeech'99. |