LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

216 Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.

94 Perception and Analysis of a Reiterant Speech Paradigm: a Functional Diagnostic of Synthetic Prosody A set of perception experiments,using reiterant speech,were designed to carry out a diagnostic of the segmentation /hierarchisation linguistic function of prosody.The prosodic parameters of F0,syllabic duration and intensity of the stimuli used during this experiment were extracted.Several dissimilarity measures (Correlation,root-mean-square distance and mutual information)were used to match the results of the subjective experiment.This comparison of the listeners ’perception with acoustic parameters is intended to underline the acoustic keys used by listeners to judge the adequacy of prosody to perform a given linguistic function.

16 Perceptual Evaluation of a New Subband Low Bit Rate Speech Compression System based on Waveform Vector Quantization and SVD Postfiltering This paper proposes a new low rate speech coding algorithm, based on a subband approach. At first, a frame of the incoming signal is fed to a low pass filter, thus yielding the low frequency (LF) part. By subtracting the latter from the incoming signal the high frequency (HF), non-smoothed part is obtained. The HF part is modeled using waveform vector quantisation (VQ), while the LF part is modeled using a spectral estimation method based on a Hankel matrix, its shift invariant property and SVD, called CSE. At the receiver side an adaptive postfiltering based on SVD is performed for the HF part, a simple resynthesis for the LF part, before the two components are added in order to produce the reconstructed signal. Progressive speech compression (variable degree of analysis/synthesis at transmitter/receiver) is thus possible resulting in a variable bit rate scheme. The new method is compared to the CELP algorithm at 4800 bps and is proven of similar quality, in terms of intelligibility and segmental SNR. Moreover, perceptual evaluation tests of the new method were conducted for different bit rates up to 1200 bps and the majority of the evaluators indicated that the technique provides intelligible reconstruction.

48 Perceptual Evaluation of Text-to-Speech Implementation of Enclitic Stress in Greek This paper presents a perceptual evaluation of a text to speech (TTS) synthesizer in Greek with respect to acoustic registration of enclitic stress and related naturalness and intelligibility. Based on acoustical measurements and observations of naturally recorded utterances, the corresponding output of a commercially available formant-based speech synthesizer was altered and the results were subjected to perceptual evaluation. Pitch curve, intensity, and duration of the syllable bearing enclitic stress, were acoustically manipulated, while a phonetically identical phrase contrasting only in stress served as control stimulus. Ten listeners judged the perceived naturalness and preference (in pairs) and the stress pattern of each variant of a base phrase. It was found that intensity modification adversely affected perceived naturalness while increasing perceived stress prominence. Duration modification had no appreciable effect. Pitch curve modification tended to produce an improvement in perceived naturalness and preference but the results failed to achieve statistical significance. The results indicated that the current prosodic module of the speech synthesizer reflects a good balance between prominence of stress assignment, intelligibility, and naturalness.

53 PLEDIT - A New Efficient Tool for Management of Multilingual Pronunciation Lexica and Batchlists The program tool PLEDIT - Pronunciation Lexica Editor - has been created for efficient handling with pronunciation lexica and batchlists. PLEDIT is designed as a GUI, which incorporates tools for fast and efficient management of pronunciation lexica and batchlists. The tool is written in cl/Tk/Tix and can thus be easily ported to different platforms. PLEDIT supports three lexicon format types, which are Siemens, SpeechDat and CMU lexicon formats. PLEDIT enables full editing capability for lexica and batchlists and supports work with multilingual resources. Some functions have been built in as external programs written in the C program language. With these external programs higher speed and efficiency of the PLEDIT have been achieved.

72 Portuguese Corpora at CLUL The Corpus de Referencia do Portugues Contemporaneo (CRPC) is being developed in the Centro de Linguistica da Universidade de Lisboa (CLUL) since 1988 under a perspective of research data enlargement, in the sense of concepts and hypothesis verification by rejecting the sole use of intuitive data. The intention of creating this open corpus is to establish an on-line representative sample collection of general usage contemporary Portuguese: a main corpus of great dimension as well as several specialized corpora. The CRPC has nowadays around 92 million words. Following the use in this area, the CRPC project intends to establish a linguistic database accessible to everyone interested in making theoretical and practical studies or applications. The Dialectal oral corpus of the Atlas Linguistico-Etnografico de Portugal e da Galiza (ALEPG) is constituted by approximately 3500 hours of speech collected by the CLUL Dialectal Studies Research Group and recorded in analogic audio tape. This corpus contains mainly directed speech: answers to a linguistic questionnaire essentially lexical, but also focusing on some phonetic and morpho-phonological phenomena. An important part of spontaneous speech enables other kind of studies such as syntactic, morphological or phonetic ones.

169 PoS Disambiguation and Partial Parsing Bidirectional Interaction This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head information as a word analysis. This interaction of PoS disambiguation and partial parsing reduces the effort needed for writing rules considerably. Furthermore, the methodology we propose improves both efficiency and results.

224 POSCAT: A Morpheme-based Speech Corpus Annotation Tool As more and more speech systems require linguistic knowledge to accommodate various levels of applications, corpora that are tagged with linguistic annotations as well as signal-level annotations are highly recommended for the development of today’s speech systems. Among the linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single unified speech corpus annotation tool that enables corpus builders to link linguistic annotations to signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme-based linguistic engines. Our tool integrates a syntactic analyzer, phrase break detector, grapheme-to-phoneme converter and automatic phonetic aligner together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool are multi-purpose and applicable to both speech recognition and text-to-speech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines.

303 Predictive Performance of Dialog Systems This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients.

249 Principled Hidden Tagset Design for Tiered Tagging of Hungarian For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features.

112 Producing LRs in Parallel with Lexicographic Description: the DCC project This paper is a brief presentation of some aspects of the most important lexicographical project that is being carried out in Catalonia: the DCC (Dictionary of Contemporary Catalan) project. After making a general description of the aims of the project, the specific goal of my contribution is to present the general strategy of our lexicographical description, consisting in the production of an electronic dictionary able to be the common repository from which we will obtain different derived products (the human dictionary, among them). My concern is to show to which extent human and computer lexicography can share descriptions, and the results of lexicographic work can be taken as a language resource in this new perspective. I will present different aspects and criteria of our dictionary, taking the different layers (morphology, syntax, semantics) as a guideline.

328 Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries In this paper, the main features of manually produced bilingual dictionaries, which have been originally designed for human use, are considered. The problem is to find the way to use such kind of dictionaries in order to produce bilingual language resources that could make a base for automate text processing, such as machine translation, cross-lingual interrogation in text retrieval, etc. The transformation technology suggested hereby is based on XML-parsing of the file obtained from the source data by means of serial of special procedures. In order to produce well-formed XML-file, automatic procedures suffice. But in most cases, there are still semantic problems and inconveniencies that could be retired only in interactive way. However, the volume of this work can be minimized due to automatic pre-editing and suitable XML mark-up. The paper presents the results of R&D project which was carried out in the framework of ELRA’1999 Call for proposals on Language resources Production. The paper is based on the authors’ experience with English-Russian and French-Russian dictionaries, but the technology can be applied to other pairs of languages.

85 Providing Internet Access to Portuguese Corpora: the AC/DC Project In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do portugues) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilizacao de Corpora, roughly ''Access and Availability of Corpora'') allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese.

!--mstheme-->

Paper	Paper Title	Abstract
216	Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus	This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator.
94	Perception and Analysis of a Reiterant Speech Paradigm: a Functional Diagnostic of Synthetic Prosody	A set of perception experiments,using reiterant speech,were designed to carry out a diagnostic of the segmentation /hierarchisation linguistic function of prosody.The prosodic parameters of F0,syllabic duration and intensity of the stimuli used during this experiment were extracted.Several dissimilarity measures (Correlation,root-mean-square distance and mutual information)were used to match the results of the subjective experiment.This comparison of the listeners ’perception with acoustic parameters is intended to underline the acoustic keys used by listeners to judge the adequacy of prosody to perform a given linguistic function.
16	Perceptual Evaluation of a New Subband Low Bit Rate Speech Compression System based on Waveform Vector Quantization and SVD Postfiltering	This paper proposes a new low rate speech coding algorithm, based on a subband approach. At first, a frame of the incoming signal is fed to a low pass filter, thus yielding the low frequency (LF) part. By subtracting the latter from the incoming signal the high frequency (HF), non-smoothed part is obtained. The HF part is modeled using waveform vector quantisation (VQ), while the LF part is modeled using a spectral estimation method based on a Hankel matrix, its shift invariant property and SVD, called CSE. At the receiver side an adaptive postfiltering based on SVD is performed for the HF part, a simple resynthesis for the LF part, before the two components are added in order to produce the reconstructed signal. Progressive speech compression (variable degree of analysis/synthesis at transmitter/receiver) is thus possible resulting in a variable bit rate scheme. The new method is compared to the CELP algorithm at 4800 bps and is proven of similar quality, in terms of intelligibility and segmental SNR. Moreover, perceptual evaluation tests of the new method were conducted for different bit rates up to 1200 bps and the majority of the evaluators indicated that the technique provides intelligible reconstruction.
48	Perceptual Evaluation of Text-to-Speech Implementation of Enclitic Stress in Greek	This paper presents a perceptual evaluation of a text to speech (TTS) synthesizer in Greek with respect to acoustic registration of enclitic stress and related naturalness and intelligibility. Based on acoustical measurements and observations of naturally recorded utterances, the corresponding output of a commercially available formant-based speech synthesizer was altered and the results were subjected to perceptual evaluation. Pitch curve, intensity, and duration of the syllable bearing enclitic stress, were acoustically manipulated, while a phonetically identical phrase contrasting only in stress served as control stimulus. Ten listeners judged the perceived naturalness and preference (in pairs) and the stress pattern of each variant of a base phrase. It was found that intensity modification adversely affected perceived naturalness while increasing perceived stress prominence. Duration modification had no appreciable effect. Pitch curve modification tended to produce an improvement in perceived naturalness and preference but the results failed to achieve statistical significance. The results indicated that the current prosodic module of the speech synthesizer reflects a good balance between prominence of stress assignment, intelligibility, and naturalness.
53	PLEDIT - A New Efficient Tool for Management of Multilingual Pronunciation Lexica and Batchlists	The program tool PLEDIT - Pronunciation Lexica Editor - has been created for efficient handling with pronunciation lexica and batchlists. PLEDIT is designed as a GUI, which incorporates tools for fast and efficient management of pronunciation lexica and batchlists. The tool is written in cl/Tk/Tix and can thus be easily ported to different platforms. PLEDIT supports three lexicon format types, which are Siemens, SpeechDat and CMU lexicon formats. PLEDIT enables full editing capability for lexica and batchlists and supports work with multilingual resources. Some functions have been built in as external programs written in the C program language. With these external programs higher speed and efficiency of the PLEDIT have been achieved.
72	Portuguese Corpora at CLUL	The Corpus de Referencia do Portugues Contemporaneo (CRPC) is being developed in the Centro de Linguistica da Universidade de Lisboa (CLUL) since 1988 under a perspective of research data enlargement, in the sense of concepts and hypothesis verification by rejecting the sole use of intuitive data. The intention of creating this open corpus is to establish an on-line representative sample collection of general usage contemporary Portuguese: a main corpus of great dimension as well as several specialized corpora. The CRPC has nowadays around 92 million words. Following the use in this area, the CRPC project intends to establish a linguistic database accessible to everyone interested in making theoretical and practical studies or applications. The Dialectal oral corpus of the Atlas Linguistico-Etnografico de Portugal e da Galiza (ALEPG) is constituted by approximately 3500 hours of speech collected by the CLUL Dialectal Studies Research Group and recorded in analogic audio tape. This corpus contains mainly directed speech: answers to a linguistic questionnaire essentially lexical, but also focusing on some phonetic and morpho-phonological phenomena. An important part of spontaneous speech enables other kind of studies such as syntactic, morphological or phonetic ones.
169	PoS Disambiguation and Partial Parsing Bidirectional Interaction	This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head information as a word analysis. This interaction of PoS disambiguation and partial parsing reduces the effort needed for writing rules considerably. Furthermore, the methodology we propose improves both efficiency and results.
224	POSCAT: A Morpheme-based Speech Corpus Annotation Tool	As more and more speech systems require linguistic knowledge to accommodate various levels of applications, corpora that are tagged with linguistic annotations as well as signal-level annotations are highly recommended for the development of today’s speech systems. Among the linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single unified speech corpus annotation tool that enables corpus builders to link linguistic annotations to signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme-based linguistic engines. Our tool integrates a syntactic analyzer, phrase break detector, grapheme-to-phoneme converter and automatic phonetic aligner together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool are multi-purpose and applicable to both speech recognition and text-to-speech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines.
303	Predictive Performance of Dialog Systems	This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients.
249	Principled Hidden Tagset Design for Tiered Tagging of Hungarian	For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features.
112	Producing LRs in Parallel with Lexicographic Description: the DCC project	This paper is a brief presentation of some aspects of the most important lexicographical project that is being carried out in Catalonia: the DCC (Dictionary of Contemporary Catalan) project. After making a general description of the aims of the project, the specific goal of my contribution is to present the general strategy of our lexicographical description, consisting in the production of an electronic dictionary able to be the common repository from which we will obtain different derived products (the human dictionary, among them). My concern is to show to which extent human and computer lexicography can share descriptions, and the results of lexicographic work can be taken as a language resource in this new perspective. I will present different aspects and criteria of our dictionary, taking the different layers (morphology, syntax, semantics) as a guideline.
328	Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries	In this paper, the main features of manually produced bilingual dictionaries, which have been originally designed for human use, are considered. The problem is to find the way to use such kind of dictionaries in order to produce bilingual language resources that could make a base for automate text processing, such as machine translation, cross-lingual interrogation in text retrieval, etc. The transformation technology suggested hereby is based on XML-parsing of the file obtained from the source data by means of serial of special procedures. In order to produce well-formed XML-file, automatic procedures suffice. But in most cases, there are still semantic problems and inconveniencies that could be retired only in interactive way. However, the volume of this work can be minimized due to automatic pre-editing and suitable XML mark-up. The paper presents the results of R&D project which was carried out in the framework of ELRA’1999 Call for proposals on Language resources Production. The paper is based on the authors’ experience with English-Russian and French-Russian dictionaries, but the technology can be applied to other pairs of languages.
85	Providing Internet Access to Portuguese Corpora: the AC/DC Project	In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do portugues) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilizacao de Corpora, roughly ''Access and Availability of Corpora'') allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese.