Title |
Adapting a Part-of-Speech Tagset to Non-Standard Text: the Case of STTS |
Authors |
Heike Zinsmeister, Ulrich Heid and Kathrin Beck |
Abstract |
The Stuttgart-Tübingen TagSet (STTS) is a de-facto standard for the part-of-speech tagging of German texts. Since its first publication in 1995, STTS has been used in a variety of annotation projects, some of which have adapted the tagset slightly for their specific needs. Recently, the focus of many projects has shifted from the analysis of newspaper text to that of non-standard varieties such as user-generated content, historical texts, and learner language. These text types contain linguistic phenomena that are missing from or are only suboptimally covered by STTS; in a community effort, German NLP researchers have therefore proposed additions to and modifications of the tagset that will handle these phenomena more appropriately. In addition, they have discussed alternative ways of tag assignment in terms of bipartite tags (stem, token) for historical texts and tripartite tags (lexicon, morphology, distribution) for learner texts. In this article, we report on this ongoing activity, addressing methodological issues and discussing selected phenomena and their treatment in the tagset adaptation process. |
Topics |
Standards for LRs, LR National/International Projects, Infrastructural/Policy issues |
Full paper |
Adapting a Part-of-Speech Tagset to Non-Standard Text: the Case of STTS |
Bibtex |
@InProceedings{ZINSMEISTER14.721,
author = {Heike Zinsmeister and Ulrich Heid and Kathrin Beck}, title = {Adapting a Part-of-Speech Tagset to Non-Standard Text: the Case of STTS}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} } |