LREC 2000 2nd International Conference on Language Resources & Evaluation | |
Conference Papers
Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377. |
Previous Paper Next Paper
Title | Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts |
Authors |
Cieri Christopher (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, ccieri@ldc.upenn.edu) Graff David (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, graff@ldc.upenn.edu) Liberman Mark (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, myl@ldc.upenn.edu) Martey Nii (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, nmartey@ldc.upenn.edu) Strassel Stephanie (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, strassel@ldc.upenn.edu) |
Keywords | Annotation, Data Collection and Distribution, Information Retrieval, Language Resources, Topic Detection and Tracking |
Session | Session SP3 - Spoken Language Resources' Projects |
Abstract | This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora. |