Title |
Disclose Models, Hide the Data - How to Make Use of Confidential Corpora without Seeing Sensitive Raw Data |
Authors |
Erik Faessler, Johannes Hellrich and Udo Hahn |
Abstract |
Confidential corpora from the medical, enterprise, security or intelligence domains often contain sensitive raw data which lead to severe restrictions as far as the public accessibility and distribution of such language resources are concerned. The enforcement of strict mechanisms of data protection consitutes a serious barrier for progress in language technology (products) in such domains, since these data are extremely rare or even unavailable for scientists and developers not directly involved in the creation and maintenance of such resources. In order to by-pass this problem, we here propose to distribute trained language models which were derived from such resources as a substitute for the original confidential raw data which remain hidden to the outside world. As an example, we exploit the access-protected German-language medical FRAMED corpus from which we generate and distribute models for sentence splitting, tokenization and POS tagging based on software taken from OPENNLP, NLTK and JCORE, our own UIMA-based text analytics pipeline. |
Topics |
LR Infrastructures and Architectures, Corpus (Creation, Annotation, etc.) |
Full paper |
Disclose Models, Hide the Data - How to Make Use of Confidential Corpora without Seeing Sensitive Raw Data |
Bibtex |
@InProceedings{FAESSLER14.936,
author = {Erik Faessler and Johannes Hellrich and Udo Hahn}, title = {Disclose Models, Hide the Data - How to Make Use of Confidential Corpora without Seeing Sensitive Raw Data}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} } |