| Title | 
         Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian  | 
    
| Author(s) | 
         Božo Bekavac (1), Petya Osenova (2), Kiril Simov (2), Marko Tadić (1) (1) Institute of Linguistics, Faculty of Philosophy, University of Zagreb, Ivana Lucica 3, 10000 Zagreb, Croatia, bbekavac@ffzg.hr, marko.tadic@ffzg.hr; (2) BulTreeBank Project, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Acad. G. Bonchev St. 25A, 1113 Sofia, Bulgaria, petya@bultreebank.org, kivs@bultreebank.org  | 
    
| Session | 
         P12-W  | 
    
| Abstract | 
         This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of `light' and `hard' comparable corpora is introduced. At this stage we aim at producing a `light' bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.  | 
    
| Keyword(s) | 
         comparable corpora, text alignment  | 
    
| Language(s) | Bulgarian, Croatian | 
| Full Paper |