LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model
Authors Shirai Kiyoaki (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, kshirai@cl.cs.titech.ac.jp)
Tanaka Hozumi (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, tanaka@cl.cs.titech.ac.jp)
Tokunaga Takenobu (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, take@cl.cs.titech.ac.jp)
Keywords Human Intervention, Iterative Learning, Statistical Language Model, Tree-Annotated Coprpus
Session Session WP2 - Corpus Annotation
Full Paper 341.ps, 341.pdf
Abstract In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable parse tree.