
arXiv: 2010.03083
Abstract With this work, we present a publicly available data set of the history of all the references (more than 55 million) ever used in the English Wikipedia until June 2019. We have applied a new method for identifying and monitoring references in Wikipedia, so that for each reference we can provide data about associated actions: creation, modifications, deletions, and reinsertions. The high accuracy of this method and the resulting data set was confirmed via a comprehensive crowdworker labeling campaign. We use the data set to study the temporal evolution of Wikipedia references as well as users’ editing behavior. We find evidence of a mostly productive and continuous effort to improve the quality of references: There is a persistent increase of reference and document identifiers (DOI, PubMedID, PMC, ISBN, ISSN, ArXiv ID) and most of the reference curation work is done by registered humans (not bots or anonymous editors). We conclude that the evolution of Wikipedia references, including the dynamics of the community processes that tend to them, should be leveraged in the design of relevance indexes for altmetrics, and our data set can be pivotal for such an effort.
FOS: Computer and information sciences, Science (General), info:eu-repo/classification/ddc/320, Monitoring, data set, Informationsquelle, Information and Documentation, Libraries, Archives, Q1-390, Computer Science - Computers and Society, Interactive, electronic Media, Information und Dokumentation, Bibliotheken, Archive, altmetrics; data set; edit histories; Wikipedia editors; Wikipedia references, Scientometrics, Bibliometrics, Informetrics, Computers and Society (cs.CY), data quality, Digital Libraries (cs.DL), interaktive, elektronische Medien, News media, journalism, publishing, edit histories, Wikipedia editors, Wikipedia references, Datenqualität, 10800, Szientometrie, Bibliometrie, Informetrie, Computer Science - Digital Libraries, altmetrics, Daten, source of information, monitoring, data, Publizistische Medien, Journalismus,Verlagswesen, Wikipedia, ddc: ddc:070
FOS: Computer and information sciences, Science (General), info:eu-repo/classification/ddc/320, Monitoring, data set, Informationsquelle, Information and Documentation, Libraries, Archives, Q1-390, Computer Science - Computers and Society, Interactive, electronic Media, Information und Dokumentation, Bibliotheken, Archive, altmetrics; data set; edit histories; Wikipedia editors; Wikipedia references, Scientometrics, Bibliometrics, Informetrics, Computers and Society (cs.CY), data quality, Digital Libraries (cs.DL), interaktive, elektronische Medien, News media, journalism, publishing, edit histories, Wikipedia editors, Wikipedia references, Datenqualität, 10800, Szientometrie, Bibliometrie, Informetrie, Computer Science - Digital Libraries, altmetrics, Daten, source of information, monitoring, data, Publizistische Medien, Journalismus,Verlagswesen, Wikipedia, ddc: ddc:070
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 5 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
