In the paper we present a new tool to evaluate lexical saturation of text corpora, where lexical saturation refers to a state in which it is hard to find new lexemes outside the corpus. Estimation of the saturation degree for a given corpus contributes in a natural way to the corpus quality evaluation. We propose saturation tests as a stopping criterion for subcorpora creation. Although the first application of the TSCC tool is the evaluation of lexical coverage of corpora, it may be equally useful to study corpora representativeness for various phenomena, and – more generally – their usefulness for corpus - based research, both theoretical and practical (as e.g. studies of information impact) . It may serve for cost evaluation of expensive engineering tasks in language competence modelling for AI purposes as well as in literary research . The system ( TSCC ) is highly language independent, i.e. it may be applied directly or easily adapted to any language in which the text units may be represented in alphabetic scripts. Its preliminary version (OCASSC) has been tested on a corpus of clients’ opinions published by booking.com. The prototype will be freely distributed for beta testing.
@InProceedings{VETULANI18.6, author = {Zygmunt Vetulani and Marta Witkowska}, title = {TSCC: a New Tool to Create Lexically Saturated Text Subcorpora}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Jana Diesner and Georg Rehm and Andreas Witt}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-05-4}, language = {english} }