LREC 2018 Proceedings

Summary of the paper

Title	BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
Authors	Benjamin Heinzerling and Michael Strube
Abstract	We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages better than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb.
Topics	Morphology, Multilinguality, Semantics
Full paper	BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
Bibtex	@InProceedings{HEINZERLING18.1049, author = {Benjamin Heinzerling and Michael Strube}, title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }