LREC 2008 Proceedings

Summary of the paper

Title	The JOS Morphosyntactically Tagged Corpus of Slovene
Authors	Tomaš Erjavec and Simon Krek
Abstract	The JOSmorphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million-word partially hand validated corpus. The two corpora have been sampled from the 600M-word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.
Language	Single language
Topics	Corpus (creation, annotation, etc.), Tagging, Standards for LRs
Full paper	The JOS Morphosyntactically Tagged Corpus of Slovene
Slides	-
Bibtex	@InProceedings{ERJAVEC08.89, author = {Tomaš Erjavec and Simon Krek}, title = {The JOS Morphosyntactically Tagged Corpus of Slovene}, booktitle = {Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)}, year = {2008}, month = {may}, date = {28-30}, address = {Marrakech, Morocco}, editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-4-0}, note = {http://www.lrec-conf.org/proceedings/lrec2008/}, language = {english} }