Title

A Search Tool for Corpora with Positional Tagsets and Ambiguities

Author(s)

Adam Przepiórkowski (1); Zygmunt Krynicki (2); Łukasz Dębowski (1); Marcin Woliński (1); Daniel Janus (3); Piotr Bański (4)

(1) Polish Academy of Sciences, Institute of Computer Science, ul.~Ordona 21, 01-237 Warsaw, Poland - {adamp, ldebowsk, wolinski}@ipipan.waw.pl; (2) Polish-Japanese Institute of Information Technology, ul.~Koszykowa 86, 02-008 Warsaw, Poland - zygmunt.krynicki@pjwstk.edu.pl; (3) University of Warsaw, Institute of Computer Science, ul.~Banacha 2, 02-097 Warsaw, Poland, nathell@bach.ipipan.waw.pl; (4) University of Warsaw, Institute of English, ul.~Nowy Świat 4, 00-497 Warsaw, Poland, bansp@ipipan.waw.pl,

Session

P14-W

Abstract

This article describes POLIQARP, a corpus indexing and query tool, which understands positional tagsets and which does not assume that word forms are annotated with unique morphosyntactic tags. POLIQARP is designed to be applicable to a variety of languages and tagsets: it works with XML-encoded texts, uses the UTF-8 character set, and allows for an external specification of the tagset. Currently, POLIQARP is used for indexing and searching a morphosyntactically annotated corpus of Polish.

Keyword(s)

corpus, positional tagset, ambiguity, concordancer, XCES, POS, part-of-speech, CQP

Language(s) Polish (but the tool is not language-specific)
Full Paper

275.pdf