Title |
Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian |
Author(s) |
Božo Bekavac (1), Petya Osenova (2), Kiril Simov (2), Marko Tadić (1) (1) Institute of Linguistics, Faculty of Philosophy, University of Zagreb, Ivana Lucica 3, 10000 Zagreb, Croatia, bbekavac@ffzg.hr, marko.tadic@ffzg.hr; (2) BulTreeBank Project, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Acad. G. Bonchev St. 25A, 1113 Sofia, Bulgaria, petya@bultreebank.org, kivs@bultreebank.org |
Session |
P12-W |
Abstract |
This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of `light' and `hard' comparable corpora is introduced. At this stage we aim at producing a `light' bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined. |
Keyword(s) |
comparable corpora, text alignment |
Language(s) | Bulgarian, Croatian |
Full Paper |