LREC 2000 2nd International Conference on Language Resources & Evaluation | ||||||
Title | Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model |
Authors | Shirai Kiyoaki (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, kshirai@cl.cs.titech.ac.jp) Tanaka Hozumi (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, tanaka@cl.cs.titech.ac.jp) Tokunaga Takenobu (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, take@cl.cs.titech.ac.jp) |
Keywords | Human Intervention, Iterative Learning, Statistical Language Model, Tree-Annotated Coprpus |
Session | Session WP2 - Corpus Annotation |
Full Paper | 341.ps, 341.pdf |
Abstract | In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable parse tree. |