Title

Combining statistics on n-grams for automatic term recognition

Authors

Almudena Ballester (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain)

Ángel Martín Municio (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain; Real Academia de Ciencias c/ Valverde, 22-24. 28004 Madrid Spain)

Fernando Pardos (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain)

Jordi Porta Zamorano (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain)

Rafael J. Ruiz Ureña (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain)

Fernando Sánchez León (Departamento de Lingüística Computacional Real Academia Española c/ Felipe IV, 4. 28071 Madrid Spain)

Session

WO11: Specialised Written Corpora

Abstract

This paper presents the work-in-progress in the development of an automatic term recognition (ATR) system built around the Corpus Científico-Técnico (CCT). Terms are modeled using three non-correlated dimensions: unithood, domainhood and usage, applied to a set of n-grams automatically extracted from the corpus. These dimensions are combined with a supervised machine learning algorithm in order to classify n-grams as terms or non-terms. Results of both noise and silence are promising given the paucity of data employed for training. Moreover, error analysis on noise reveals that other information dimensions can be used for significantly reducing noise.

Keywords

Recognition

Full Paper

284.pdf