Summary of the paper

Title A Corpus Factory for Many Languages
Authors Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS
Abstract For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.
Topics Corpus (creation, annotation, etc.), Acquisition, LR Infrastructures and Architectures
Full paper A Corpus Factory for Many Languages
Slides -
Bibtex @InProceedings{KILGARRIFF10.79,
  author = {Adam Kilgarriff and Siva Reddy and Jan Pomikálek and Avinesh PVS},
  title = {A Corpus Factory for Many Languages},
  booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
  year = {2010},
  month = {may},
  date = {19-21},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}
 }
Powered by ELDA © 2010 ELDA/ELRA