Title |
Extracting News Web Page Creation Time with DCTFinder |
Authors |
Xavier Tannier |
Abstract |
Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92\%). DCTFinder is made freely available on http://sourceforge.net/projects/dctfinder/, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself. |
Topics |
|
Full paper |
Extracting News Web Page Creation Time with DCTFinder |
Bibtex |
@InProceedings{TANNIER14.3,
author = {Xavier Tannier}, title = {Extracting News Web Page Creation Time with DCTFinder}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} } |