Title |
Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora |
Authors |
Nelly Barbot, Olivier Boeffard and Arnaud Delhay |
Abstract |
Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings. |
Topics |
Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval, Tools, systems, applications |
Full paper |
Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora |
Bibtex |
@InProceedings{BARBOT12.381,
author = {Nelly Barbot and Olivier Boeffard and Arnaud Delhay}, title = {Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |