The process of creating large text corpora for different languages, genres, and purposes from data available on the Web involves many different tools, configurations, and - sometimes - complex distributed hardware setups. This results in increasingly complex processes with a variety of potential configurations and error sources for each involved tool. In the field of commercial management, Business Process Management (BPM) is used successfully to cope with similar complex workflows in a multi-actor environment. Like enterprises, research environments are facing a gap between the IT and other departments that needs to be bridged and also have to adapt to new research questions quickly. In this paper we demonstrate the usefulness of applying these approved strategies and tools to the field of linguistic resource creation and management. For this purpose an established workflow for the creation of Web corpora was adapted and integrated into a popular BPM tool and the immediate benefits for fault detection, quality management and support of distinct roles in the generation process are explained.
@InProceedings{KURAS18.11, author = {Christoph Kuras ,Thomas Eckart ,Uwe Quasthoff and Dirk Goldhahn}, title = {Automation, Management and Improvement of Text Corpus Production}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Piotr Banski and Marc Kupietz and Adrien Barbaresi and
Hanno Biber and Evelyn Breiteneder and Simon Clematide and Andreas Witt}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-14-6}, language = {english} }