LREC 2000 - Papers

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers

Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Previous Paper Next Paper

Title A Flexible Infrastructure for Large Monolingual Corpora

Authors Quasthoff Uwe (Leipzig University, Computer Science Institute, NLP Dept., Augustusplatz 10/11, 04109 Leipzig, Germany, quasthoff@informatik.uni-leipzig.de)
Wolff Christian (Leipzig University, Computer Science Institute, NLP Dept., Augustusplatz 10/11, 04109 Leipzig, Germany, wolff@informatik.uni-leipzig.de)

Keywords Collocations, Information Extraction, Monolingual Corpora, Web Search

Session Session WO6 - Acquisition of Lexical Information

Abstract In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).

ana">

Title	A Flexible Infrastructure for Large Monolingual Corpora
Authors	Quasthoff Uwe (Leipzig University, Computer Science Institute, NLP Dept., Augustusplatz 10/11, 04109 Leipzig, Germany, quasthoff@informatik.uni-leipzig.de) Wolff Christian (Leipzig University, Computer Science Institute, NLP Dept., Augustusplatz 10/11, 04109 Leipzig, Germany, wolff@informatik.uni-leipzig.de)
Keywords	Collocations, Information Extraction, Monolingual Corpora, Web Search
Session	Session WO6 - Acquisition of Lexical Information
Abstract	In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).