Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
versions View all 2 versions
addClaim

Text mining and analysis of "data available upon request" statements in scientific articles. (Code)

Authors: Ballester, Benoit;

Text mining and analysis of "data available upon request" statements in scientific articles. (Code)

Abstract

This repository (automatic sync from github) contains the code and data tables used to analyse the use of request based data availability statements such as “data available upon request” in the genomics, genetics, and bioinformatics literature between 2010 and 2025. The analyses focus on how often such wording is used, how it has changed over time, and how it relates to other signals of open science support at the article and journal level. Overview Using full text XML articles from PubMed Central Open Access, we detect and classify “upon request” statements and distinguish between vague formulations and cases linked to explicit access mechanisms or legitimate restrictions. In parallel, we extract multiple indicators of open science practices, including data deposition, code availability, protocol sharing, and source data provision. The repository accompanies a policy and meta research analysis and is intended to support transparency, reproducibility, and independent auditing. General content Scripts for metadata retrieval, XML parsing, and text mining Rule based classification of request based availability statements Extraction and scoring of open science support indicators Small derived tables and reference files stored in 2.data/ Large datasets and full text XML corpora are not hosted on GitHub. Folder contents 0.config/Configuration files, including the conda environment file. 1.scripts/Main analysis pipeline scripts, organised by step: 00_setup/ 01_metadata/ 02_filter/ 03_download/ 04_qc/ 05_upon_request/ 06_open_science/ 07_plots/ (notebooks) 2.data/Data tables used by the pipeline.Note: this directory is empty in Zenodo. 3.xml/Placeholder for full-text JATS XML files (not hosted on GitHub, in Zenodo. Includes a short README.txt. 3.no_cc_code/Notes about code or components not redistributed. Includes a short README.txt. 4.analyses/Precomputed analysis outputs and figures (PDFs) used for the manuscript (for example journal-level plots and OSSI summaries), plus supporting subfolders. Notebooks under_request_overview.ipynbSummary analysis of the prevalence, classification, and temporal trends of “data available upon request” statements across journals and years. open_science.ipynbAnalysis of open science support indicators (OSSI). Data availability Full text XML files and large derived datasets used for the analyses are archived separately on Zenodo.Links to the corresponding Zenodo records will be provided. Usage To reproduce the analysis: Clone the repository Install dependencies via conda using the provided environment.yml Place downloaded XML files and large datasets in the appropriate directories as documented Run the parsing and classification scripts in order Open the notebooks in 07_plots/ to regenerate figures Status This repository reflects the analysis pipeline used for the associated manuscript.Minor updates and documentation improvements may occur, but the overall structure is stable. License This project is released under the GNU General Public License v3.0.See the LICENSE file for details. Citation If you use this code or derived analyses in academic work, please cite the associated manuscript.A CITATION.cff file will be added. If you use these XML files in academic work, please cite the associated manuscript and the Zenodo records below (code, XML corpus, and derived tables). Manuscript: Ballester, B. (2026). *From ‘data available upon request’ to accountable data access in genomics*. DOI: to be added. Code (Zenodo):Ballester, B. (2026). *Code: Text mining and analysis of “data available upon request” statements in scientific articles* (v1). Zenodo. https://doi.org/10.5281/zenodo.18339878 Full-text XML corpus (this record): Ballester, B. (2026). *XML: PubMed Central Open Access Subset JATS XML files used for “data available upon request” analyses* (2010–2025). Zenodo. DOI: https://doi.org/10.5281/zenodo.18377386 Derived tables (2.data record): Ballester, B. (2026). *Data: Text mining and analysis of “data available upon request” statements in scientific articles*. Zenodo. DOI: https://doi.org/10.5281/zenodo.18375259

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities