Title

SAM: System for Multi-criteria Text Alignment

Authors

Hatem Ghorbel (Swiss Federal Institute of Technology EPFL Faculté Informatique et Communications Computer Science Theory Laboratory LITH IN Ecublens, 1015 Lausanne, Switzerland )

Giovanni Coray (Swiss Federal Institute of Technology EPFL Faculté Informatique et Communications Computer Science Theory Laboratory LITH IN Ecublens, 1015 Lausanne, Switzerland )

André Linden (Swiss Federal Institute of Technology EPFL Faculté Informatique et Communications Computer Science Theory Laboratory LITH IN Ecublens, 1015 Lausanne, Switzerland )

Session

WP1: Corpora & Corpus Tools

Abstract

The problem of text alignment is to establish the correspondence between subparts of two ore more translations or versions of the same document. Most of the methods used in alignment are based on the statistical analysis of word or character frequencies or of string occurrences. In order to achieve more accurate results, other methods have incorporated some structural properties of the documents as further criteria.

When addressing the problem of alignment to align different versions of medieval texts namely prose and verse versions, we need to consider more efficient methods of content comparison. In this article, we propose an extension to the existing methods of alignment where we consider further linguistic and structural properties of the texts. As a linguistic criterion of alignment, we propose some heuristics to calculate similarities at the lexical, morphological, syntactic and semantic level of the texts. On the other hand, as a structural criterion, we extend the similarity measures to take into account different properties of the rhetorical structure of the texts. The process of alignment is therefore an optimization problem that maximizes linguistic and structural similarities between aligned pairs of parallel versions.

Keywords

Multi-Criteria text alignment

Full Paper

66.pdf