W2 2018 Proceedings

Summary of the paper

Title	Annotation of the Corpus of the Saeima with Multilingual Standards
Authors	Roberts Darģis, Ilze Auziņa, Uldis Bojārs, Pēteris Paikens and Artūrs Znotiņš
Abstract	This paper describes a release of corpus of Saeima (parliament of Latvia) as open data resources for multidisciplinary research. The corpus consists of the transcription of Latvian parliamentary debates from 1993 until 2017, containing 38 million tokens from 468 speakers. Current comparative research of parliamentary debate is not sufficiently facilitated by simply providing unannotated corpora and results mostly in monolingual research by local researchers. We propose that augmenting such corpora with extra layers according to commonly used multilingual standards would make it easier to compare and contrast multiple corpora in different languages. In this regard, we believe that the key factors that need to be added are identifiers of entities mentioned in each utterance, and morphosyntactic information for linguistic analysis. For these reasons, the provided corpus is augmented with named entity linking to the Wikidata knowledge base (provided as linked data), automated translations to English, and morphological and syntactic annotations in Universal Dependency format.
Full paper	Annotation of the Corpus of the Saeima with Multilingual Standards
Bibtex	@InProceedings{DARĢIS18.21, author = {Roberts Darģis ,Ilze Auziņa ,Uldis Bojārs ,Pēteris Paikens and Artūrs Znotiņš}, title = {Annotation of the Corpus of the Saeima with Multilingual Standards}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Darja Fišer and Maria Eskevich and Franciska de Jong}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-02-3}, language = {english} }