
This repository contains the data, scripts, and results for the Impact of Open Access Routes on Topic Persistence case study, part of the PATHOS project. Overview Artificial intelligence methods are being rapidly mobilized to tackle the climate crisis, but the knowledge base often burns bright and fades quickly. This case study asks whether two distinct Open Access (OA) routes help AI-for-Climate research topics stay active in the literature: Green OA: self-archiving in repositories Published OA: journal-mediated open access with a clear licence Bronze OA and dual-mode publications are excluded for treatment clarity. Closed Access (CA) articles serve as the counterfactual. By foregrounding topic persistence as a key dimension of impact, the study goes beyond short-term citation counts and investigates whether openness helps research topics remain visible long enough to demonstrate their potential. Repository Structure ├── README.md ├── fos_taxonomy_v0.1.2.json ├── persistent_topics_create_collection.py ├── persistent_topics_find_paper_openaireids.py ├── persistent_topics_find_paper_affiliations.py ├── persistent_topics_get_collection_author_gender.py ├── persistent_topics_calculate_indicators.py ├── persistent_topics_calculate_indicators_sdg.py ├── persistent_topics_indicators_create_data_for_vis.py └── persistent_topics_collection_w_outcomes/ ├── complete_collection_df.parquet / .xlsx ├── topic_attribution_df.parquet / .xlsx ├── results/ │ ├── analysis_conclusions.txt │ ├── summary_statistics.xlsx │ ├── treatment_effects_green_oa.xlsx │ ├── treatment_effects_published_oa.xlsx │ ├── descriptive_effects_any_oa.xlsx │ ├── tables/ │ │ ├── 01_executive_summary.xlsx │ │ ├── 02_treatment_group_characteristics.xlsx │ │ ├── 03_causal_effects_summary.xlsx │ │ ├── 04_topic_persistence_analysis.xlsx │ │ ├── 05_gender_equity_outcomes.xlsx │ │ ├── 06_economic_impact_analysis.xlsx │ │ ├── 07_publication_year_analysis.xlsx │ │ └── 08_robustness_analysis.xlsx │ ├── visualizations/ │ │ ├── 01_sample_overview.png │ │ ├── 02_causal_effects.png │ │ ├── 03_outcome_analysis.png │ │ └── 04_temporal_and_balance.png │ └── final_visualization_data_figures/ │ ├── data/ │ └── figures/ └── results_sdg_only/ ├── sdg_analysis_conclusions.txt ├── green_matched_sdg_papers.xlsx ├── published_matched_sdg_papers.xlsx ├── closed_matched_a_sdg_papers.xlsx ├── closed_matched_b_sdg_papers.xlsx ├── tables/ │ ├── 01_sdg_distribution_matched_samples.xlsx │ ├── 02_sdg_treatment_effects.xlsx │ ├── 03_sdg_vs_non_sdg_comparison.xlsx │ ├── 04_sdg_categories_by_impact.xlsx │ ├── 05_sdg_gender_industry_collaboration.xlsx │ ├── 06_sdg_analysis_summary.xlsx │ ├── 07_sdg_alignment_comparison_matched.xlsx │ └── 08_sdg_alignment_effects_summary.xlsx └── visualizations/ ├── 01_sdg_distribution_overview.png ├── 02_sdg_treatment_effects.png ├── 03_sdg_impact_analysis.png └── 04_sdg_alignment_comparison_matched.png Data Sources External Data Sources (not included) Semantic Scholar Academic Graph: full publication metadata OpenAIRE Graph: European research infrastructure data PATSTAT: patent database for citation analysis ROR: Research Organization Registry SciNoBo toolkit: FOS classification, interdisciplinarity, SDG mapping, FWCI scores Included Data Complete processed collection with outcomes Topic attribution dataset (paper-topic mappings, persistence scores) Analysis results: matched samples, treatment effects, summary statistics SciNoBo Field of Science taxonomy (fos_taxonomy_v0.1.2.json) Scripts Data Processing persistent_topics_create_collection.py – integrates multiple data sources, outcomes, affiliations, patent citations persistent_topics_find_paper_openaireids.py – maps DOIs to OpenAIRE IDs persistent_topics_find_paper_affiliations.py – extracts affiliations, science-industry collaboration persistent_topics_get_collection_author_gender.py – gender classification of authors Analysis persistent_topics_calculate_indicators.py – main causal inference analysis (PSM for Green OA vs CA, Published OA vs CA) persistent_topics_calculate_indicators_sdg.py – SDG-focused treatment effects persistent_topics_indicators_create_data_for_vis.py – prepares final visualization datasets and figures Key Findings Sample Total: 132,134 papers (2000–2021) Green OA: 3,792 papers Published OA: 19,045 papers Closed Access: 92,998 papers Contributions New Topic Persistence Metric for long-term impact Clean OA treatment definitions (excluding dual-mode and Bronze) Separate analysis of Green vs Published OA pathways Main Results 8 significant causal effects across outcomes Enhanced topic persistence in OA papers Positive gender equity outcomes Evidence of economic impact (patents, collaborations) SDG Findings 24,948 SDG-relevant papers (18.9% of sample) 11 significant treatment effects for SDG-related research Stronger knowledge sustainability for achieving SDG goals Methodology Design Propensity Score Matching (PSM) with balanced covariates Separate analyses for Green OA vs CA and Published OA vs CA Robust outcome metrics (including new persistence measure) Treatment Definitions Green OA: repository-based Published OA: journal-based (gold, hybrid, diamond) Closed Access: no open provision Excluded: dual-mode and Bronze OA Outcomes Citation impact (traditional) Topic persistence (novel metric) Gender equity in authorship Economic impact (patents, collaboration) Field effects (disciplinary and SDG)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
