Summary of the paper

Title Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Authors Quang Thắng Đinh, Hồng Ph&432;ơng Lê, Thi Minh Huyền Nguyễn, Cẩm Tú Nguyễn, Mathias Rossignol and Xuẩn L&432;ơng Vũ
Abstract We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
Language
Topics Corpus (creation, annotation, etc.), Other
Full paper Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Slides -
Bibtex @InProceedings{INH08.493,
  author = {Quang Thắng Đinh, Hồng Ph&432;ơng Lê, Thi Minh Huyền Nguyễn, Cẩm Tú Nguyễn, Mathias Rossignol and Xuẩn L&432;ơng Vũ},
  title = {Word Segmentation of Vietnamese Texts: a Comparison of Approaches},
  booktitle = {Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)},
  year = {2008},
  month = {may},
  date = {28-30},
  address = {Marrakech, Morocco},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-4-0},
  note = {http://www.lrec-conf.org/proceedings/lrec2008/},
  language = {english}
  }

Powered by ELDA © 2008 ELDA/ELRA