Title

Multilingual XML-Based Named Entity Recognition for E-Retail Domains

Authors

Claire Grover (Language Technology Group, University of Edinburgh)

Scott McDonald (Language Technology Group, University of Edinburgh)

Donnla Nic Gearailt (Language Technology Group, University of Edinburgh)

Vangelis Karkaletsis (Institute for Informatics and Telecommunications, National Centre for Scientific Research "Demokritos")

Dimitra Farmakiotou (Institute for Informatics and Telecommunications, National Centre for Scientific Research "Demokritos")

Georgios Samaritakis (Institute for Informatics and Telecommunications, National Centre for Scientific Research "Demokritos")

Georgios Petasis (Institute for Informatics and Telecommunications, National Centre for Scientific Research "Demokritos")

Maria Teresa Pazienza (D.I.S.P., Universita di Roma Tor Vergata)

Michele Vindigni (D.I.S.P., Universita di Roma Tor Vergata)

Frantz Vichot (Informatique-CDC, Groupe Caisse des Depots)

Francis Wolinski (Informatique-CDC, Groupe Caisse des Depots)

Session

WP3: Tools & Components

Abstract

We describe the multilingual Named Entity Recognition and Classification (NERC) subpart of an e-retail product comparison system which is currently under development as part of the EU-funded project CROSSMARC. The system must be rapidly extensible, both to new languages and new domains. To achieve this aim we use XML as our common exchange format and the monolingual NERC components use a combination of rule-based and machine-learning techniques. It has been challenging to process web pages which contain heavily structured data where text is intermingled with HTML and other code. Our preliminary evaluation results demonstrate the viability of our approach.

Keywords

Recognition

Full Paper

233.pdf