Title |
A High Recall Error Identification Tool for Hindi Treebank Validation |
Authors |
Bharat Ram Ambati, Mridul Gupta, Samar Husain and Dipti Misra Sharma |
Abstract |
This paper describes the development of a hybrid tool for a semi-automated process for validation of treebank annotation at various levels. The tool is developed for error detection at the part-of-speech, chunk and dependency levels of a Hindi treebank, currently under development. The tool aims to identify as many errors as possible at these levels to achieve consistency in the task of annotation. Consistency in treebank annotation is a must for making data as error-free as possible and for providing quality assurance. The tool is aimed at ensuring consistency and to make manual validation cost effective. We discuss a rule based and a hybrid approach (statistical methods combined with rule-based methods) by which a high-recall system can be developed and used to identify errors in the treebank. We report some results of using the tool on a sample of data extracted from the Hindi treebank. We also argue how the tool can prove useful in improving the annotation guidelines which would in turn, better the quality of annotation in subsequent iterations. |
Topics |
Validation of LRs, Corpus (creation, annotation, etc.), Standards for LRs |
Full paper |
A High Recall Error Identification Tool for Hindi Treebank Validation |
Slides |
- |
Bibtex |
@InProceedings{AMBATI10.673,
author = {Bharat Ram Ambati and Mridul Gupta and Samar Husain and Dipti Misra Sharma}, title = {A High Recall Error Identification Tool for Hindi Treebank Validation}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |