Summary of the paper

Title Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
Authors Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan and Ann Sawyer
Abstract The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.
Topics LR Infrastructures and Architectures, Machine Translation, SpeechToSpeech Translation
Full paper Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
Bibtex @InProceedings{SONG14.1094,
  author = {Zhiyi Song and Stephanie Strassel and Haejoong Lee and Kevin Walker and Jonathan Wright and Jennifer Garland and Dana Fore and Brian Gainor and Preston Cabe and Thomas Thomas and Brendan Callahan and Ann Sawyer},
  title = {Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus},
  booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},
  language = {english}
 }
Powered by ELDA © 2014 ELDA/ELRA