Cleaning different types of DOI errors found in cited references on Crossref using automated methods

{"references": ["Boente, R., Massari, A., Santini, C., & Tural, D. (2021a). Classes of errors in DOI names (Data Management Plan) (Version 5). Zenodo. https://doi.org/10.5281/zenodo.4733919", "Boente, R., Massari, A., Santini, C., & Tural, D. (2021b). Protocol: Investigating DOIs classes of errors. protocols.io. https://dx.doi.org/10.17504/protocols.io.buuknwuw", "Boente, R., Massari, A., Santini, C., & Tural, D. (2021). Classes of errors in DOI names: output dataset (Version v1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4892551", "Bostock, M. (2021). D3: Data-Driven Documents. Software Heritage. https://archive.softwareheritage.org/swh:1:dir:35fe697ae5a21e96d9fc01d890b30010e23c16dd", "Buchanan, R. A. (2006). Accuracy of cited references: The role of citation databases. College and Research Libraries, 67(4), 292\u2013303. https://doi.org/10.5860/crl.67.4.292", "Cioffi, A., Coppini, S., Moretti, A., & Shahidzadeh A.N. (2021, May 3). Investigating missing citations in COCI and publishers involved (Version First). Zenodo. http://doi.org/10.5281/zenodo.4735636", "Crossref. (2021). January 2021 Public Data File from Crossref. https://doi.org/10.13003/GU3DQMJVG4", "Domanskyi, S., Szedlak, A., Hawkins, N. T., Wang, J., Paternostro, G., Piermarocchi, C. (2019). bioRxiv 539833. https://doi.org/10.1101/539833", "Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181\u20132186. https://doi.org/10.1007/s11192-014-1503-4", "Garc\u00eda-Alonso, C.R., P\u00e9rez-Naranjo, L.M. & Fern\u00e1ndez-Caballero, J.C. (2014). Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms. Ann Oper Res 219, 187\u2013202. https://doi.org/10.1007/s10479-011- 0841-3", "Heibi, I., Peroni, S., & Shotton, D. (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121(2), 1213\u20131228. https://doi.org/10.1007/s11192-019-03217-6", "International DOI Foundation. (2019). DOI\u00ae Handbook. https://doi.org/10.1000/182", "Krebs, S.L. (2018) Rhododendron. In: Van Huylenbroeck J. (eds) Ornamental Crops. Handbook of Plant Breeding, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-90698-0_26", "Massari, A., Santini, C., & Boente, R. (2021). open-sci/2020-2021-grasshoppers-code: Classes of errors in DOI names (Version 1.1.0). Zenodo. https://doi.org/10.5281/zenodo.4723983", "Peroni, S. (2021). Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4625300", "Wang, S., Van Huylenbroeck, J. and Zhang, L.-H. (2020). Adaptability of Rhododendron species to climate and growth conditions at Lushan Botanical Garden. Acta Hortic. 1288, 131-138. https://doi.org/10.17660/ActaHortic.2020.1288.20", "Xu, S., Hao, L., An, X., Zhai, D., & Pang, H. (2019). Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics, 120(3), 1427\u20131437. https://doi.org/10.1007/s11192-019-03162-4", "Zhu, J., Hu, G. & Liu, W. DOI errors and possible solutions for Web of Science. Scientometrics 118, 709\u2013718 (2019). https://doi.org/10.1007/s11192-018-2980-7"]}

Abstract Purpose The purpose of this work is to find an automated process to repair invalid DOI names that have been collected by Silvio Peroni while processing data provided by Crossref (2021). Design / methodology / approach The data needed for this research is provided as a CSV list containing more than 1 million invalid cited DOI names. First, to determine an automated process, the errors that characterize the wrong DOI names in the list need to be classified. Concentrating exclusively on the factual errors, such as additional or invalid characters, the DOI names that have become valid in the meantime can be removed. Then, a classification of those factual errors as prefix-, suffix- or other-type errors is proposed. By closer investigation and extension of already existing research in this field, this research classifies regular expressions that can be used to clean the different types of invalid DOI names: for example, by deleting additional strings at the end or the beginning. After the cleanup, the cleaned DOI names are checked for their validity again. Findings This research was able to find automated processes based on regular expressions and correct the factual errors belonging to different subclasses. Applying the proposed algorithm to the mentioned dataset, around 16% of the DOI names proved valid afterwards. The largest part of those valid DOIs consists of those made valid by cleaning up suffix errors; however, many DOIs also proved valid without cleaning, being only temporarily invalid. Research limitations / implications Checking if the DOI names are valid either consumes a lot of time or a high amount of RAM, since the process should be executed before and after the cleaning. Therefore, the described methods are only applicable on smaller datasets, unless the availability of the necessary resources is ensured. Also, there will always remain DOI names that cannot be made valid using automated processes. In these cases, it is important to find the publishers responsible for the incorrect references, which is done in a separate related project (Cioffi et al., 2021). Originality / value Building on existing research, this study extends and improves regular expressions targeted to clean DOI errors, to enhance the data quality in the COCI dataset. As the COCI project provides open access to reference lists of scientific works, the whole academic community can profit from this improvement in data quality. In addition, the methods submitted could be the base for further research in this field, allowing the correction of DOI name errors in other datasets, too.

Keywords

Crossref, invalid DOIs, open citations, OpenCitations, COCI

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average