Title |
Word-based Partial Annotation for Efficient Corpus Construction |
Authors |
Graham Neubig and Shinsuke Mori |
Abstract |
In order to utilize the corpus-based techniques that have proven effective in natural language processing in recent years, costly and time-consuming manual creation of linguistic resources is often necessary. Traditionally these resources are created on the document or sentence-level. In this paper, we examine the benefit of annotating only particular words with high information content, as opposed to the entire sentence or document. Using the task of Japanese pronunciation estimation as an example, we devise a machine learning method that can be trained on data annotated word-by-word. This is done by dividing the estimation process into two steps (word segmentation and word-based pronunciation estimation), and introducing a point-wise estimator that is able to make each decision independent of the other decisions made for a particular sentence. In an evaluation, the proposed strategy is shown to provide greater increases in accuracy using a smaller number of annotated words than traditional sentence-based annotation techniques. |
Topics |
Corpus (creation, annotation, etc.), Statistical and machine learning methods, Tools, systems, applications |
Full paper |
Word-based Partial Annotation for Efficient Corpus Construction |
Slides |
- |
Bibtex |
@InProceedings{NEUBIG10.408,
author = {Graham Neubig and Shinsuke Mori}, title = {Word-based Partial Annotation for Efficient Corpus Construction}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |