LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | LT TTT - A Flexible Tokenisation Tool |
Authors | Grover Claire (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email: grover@cogsci.ed.ac.uk) Matheson Colin (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:colin@cogsci.ed.ac.uk) Mikheev Andrei (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:mikheev@cogsci.ed.ac.uk) Moens Marc (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, marcg@cogsci.ed.ac.uk) |
Keywords | Corpus Preparation, Information Extraction, Named Entity Recognition, Tokenisation, XML Mark-Up |
Session | Session WP6 - Tools in the Written Area |
Full Paper | 93.ps, 93.pdf |
Abstract | We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation |