LREC 2010 Proceedings

Summary of the paper

Title	Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Authors	Federico Sangati, Willem Zuidema and Rens Bod
Abstract	In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.
Topics	Tools, systems, applications, Grammar and Syntax, Corpus (creation, annotation, etc.)
Full paper	Efficiently Extract Rrecurring Tree Fragments from Large Treebanks
Slides	-
Bibtex	@InProceedings{SANGATI10.613, author = {Federico Sangati and Willem Zuidema and Rens Bod}, title = {Efficiently Extract Rrecurring Tree Fragments from Large Treebanks}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} }