From PDF to structured references: A comparative study on tools for bibliographic references extraction and parsing

The aim of this work is to identify all, and only, the tools which, given a full text paper in PDF format, are able to identify, extract and parse bibliographic references. The methods they are based on don’t influence the tools selection. The first phase of this thesis is the literature review. From this step, seven tools are identified: Anystyle, Cermine, ExCite, GROBID, Pdfssa4met, Scholarcy and Science Parse. In a second moment, these tools are compared and evaluated in different research fields, providing interesting results. Indeed, Anystyle obtains the best overall score, followed by Cermine. However, in some of the subtasks investigated alongside the overall results, other tools resulted to have a better performance in specific tasks. Thus, in this variegated scenario, different solutions can be adopted on the basis on the user’s requirements.

Related Organizations

Alma Mater Studiorum University of Bologna
Italy

Keywords

Machine learning, Bibliographic references extraction, Structured citation data, References parsing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average