CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

Lemor, Antoine; Pillod, Alizée; Taylor, Matthew; Nadeau, Richard

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset

Data sources: ZENODO

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

Research datakeyboard_double_arrow_right Dataset Under curation English Publisher:Zenodo

Authors: Lemor, Antoine; Pillod, Alizée; Taylor, Matthew; Nadeau, Richard;

doi: 10.5281/zenodo.20667151

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024) — PostgreSQL edition

- Summary

Abstract

The Canadian Climate Framing (CCF) Database is a comprehensive, machine-learning-annotated corpus of climate-change media coverage in Canada. It comprises 266,271 articles from 20 major Canadian newspapers (1978-2024) processed into 9,198,958 two-sentence analytical units (82.9% English, 17.1% French). Each unit is annotated across 65 hierarchical categories by 128 BERT and CamemBERT classifiers, with a macro F1 of 0.866 on a 1,000-sentence gold standard double-coded by an independent annotator (Gwet's AC1 = 0.894, Krippendorff's α = 0.698, Cohen's κ = 0.596 on the 400 blind sentences). Each category receives an A/B/C reliability tier summarising annotation quality from classifier performance and inter-coder agreement. The deposit ships six relational tables (bibliographic metadata, sentence-level annotations, named-entity rollups, article-level aggregates, per-category reliability tiers, and 9,462,845 BAAI/bge-m3 sentence-and-title embeddings). Raw newspaper text is excluded for copyright reasons; bibliographic coordinates (media, date, title, author, page_number) are sufficient for any researcher with institutional access to Factiva, Eureka.cc or ProQuest Canadian Major Dailies to recover the original sentences. This deposit accompanies a methodology paper currently under revision at Scientific Data (Nature Portfolio).This deposit is the canonical PostgreSQL edition. It contains a pg_dump -Fd directory archive (compressed into a single .tar file) of the six relational tables, including the pgvector extension and HNSW cosine indexes for sub-second semantic-similarity search. Restoration is a one-liner:tar -xf CCF_Database.tar && createdb CCF_Database && psql -d CCF_Database -c 'CREATE EXTENSION IF NOT EXISTS vector;' && pg_restore -d CCF_Database --no-owner --no-privileges -j 8 CCF_Database_dumpA column-oriented Apache Parquet mirror of the same six tables is available as the sister deposit on Zenodo (cross-referenced in Related identifiers). The Parquet mirror is recommended for users without PostgreSQL access (it is directly readable by pandas, polars, R/arrow, DuckDB, and Spark).The full annotation pipeline, training data, manual-annotation JSONL, intercoder-reliability benchmark, methodology manuscript (LaTeX sources + PDF), and reproducibility scripts are bundled with this deposit as ccf_code_and_paper.tar.gz. The same materials are also available on the project's OSF companion deposit (10.17605/OSF.IO/Q5W47) and on the development mirror at GitHub.Requirements: PostgreSQL 16 or 17 with pgvector ≥ 0.8.2 (for halfvec(1024) storage of the sentence embeddings).

Found an issue? Give us feedback