bioRxiv 10k

This dataset is a CC-BY 4.0 subset of what bioRxiv kindly made available: https://www.biorxiv.org/tdm It is randomized and split into train (6,000), validation (2,000) and test (2,000) subsets - 10,000 PDF / XML pairs in total. The zip files further contain file lists of smaller subsets that used the subject area to potentially create a balanced subset. The zip is similar in structure to the "PMC sample 1943" dataset that was created as part of: https://doi.org/10.1145/2494266.2494271 (a working link is available from: https://grobid.readthedocs.io/en/stable/End-to-end-evaluation/). Therefore it is well suited for evaluation of PDF to XML conversion tools, such as GROBID. The dataset was created as part of eLife's ScienceBeam project.

{"references": ["Constantin, A., Steve, P., Andrei, V.: Fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the ACM Symposium on Document Engineering, pp. 177\u2013180. ACM, New York (2013). doi: 10.1145/2494266.2494271"]}

Keywords

bioRxiv, XML, JATS, PDF

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	51
download	downloads	11

51
views
11
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

51

11