LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title LT TTT - A Flexible Tokenisation Tool
Authors Grover Claire (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email: grover@cogsci.ed.ac.uk)
Matheson Colin (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:colin@cogsci.ed.ac.uk)
Mikheev Andrei (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:mikheev@cogsci.ed.ac.uk)
Moens Marc (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, marcg@cogsci.ed.ac.uk)
Keywords Corpus Preparation, Information Extraction, Named Entity Recognition, Tokenisation, XML Mark-Up
Session Session WP6 - Tools in the Written Area
Full Paper 93.ps, 93.pdf
Abstract We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation