Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

NewsWords Data (Word Counts)

Authors: Beelen, Kaspar;

NewsWords Data (Word Counts)

Abstract

NewsWords Word Counts from the British Library's Digitised Newspaper Collections Description The NewsWords dataset contains word count data derived from newspapers published in Britain during the "long" nineteenth century (1780-1920) and digitised as of 2025. These frequencies are computed from the British Library's collection. The tar file contains 269,179 JSON files. Each file captures the word counts for one month for one newspaper title. The filenames are structured as follows: "{newspaper_id}_{year}_{month}.json", e.g. "0003281_1896_07.json". Each file consists of a dictionary mapping words to their frequencies, e.g. {"newspaper": 19, "transmission": 11}. Together, the word counts represent a corpus of 120 billion tokens based on a vocabulary of 200k unique words appearing more than five times. Please follow this link to view a bar chart that breaks down the word counts by decade. The newspaper_id corresponds with NLP ids, which are documented in the British Library newspaper catalogue: > Ryan, Yann, and Luke McKernan. 2021. “Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications”. *Journal of Open Humanities Data* 7 (0): 1. https://doi.org/10.5334/johd.23. Complete metadata for this newspaper collection is available in another open dataset: > Westerling, Kalle, Timothy Hobson, Kaspar Beelen, Nilo Pedrazzini, Daniel Wilson, and Katherine McDonough. “Lwmdb Data”. *Zenodo*, December 11, 2024. https://doi.org/10.5281/zenodo.14389180. Code The NewsWords Code GitHub repository provides code for converting "raw" word counts to a more manageable sparse matrix format and contextualises these counts with additional newspaper metadata, e.g. information about price and politics. Further Information about how to use the code and query the NewsWords data is available in the GitHub README. To recreate these sparse matrices, please follow the instructions in "Create_sparse_matrices.ipynb" Limitations These word counts are derived from the digitised press, containing billions of words, spanning multiple decades. However large, these data constitute around ##% of the number of newspaper titles that circulated in Great Britain. In our paper "Whose News? Critical methods for assessing bias in large historical datasets" (under review) we have tackled the issue of representativeness, and point out that these exhibit some partisan bias—in the sense that they overrepresent conservative and liberal newspaper titles—which varies over the 19th century. > "Whose News? Critical methods for assessing bias inlarge historical datasets" (under review) For more information about the method and data see also: > Beelen, Kaspar, Jon Lawrence, Daniel C Wilson, and David Beavan, 2023. 'Bias and representativeness in digitized newspaper collections: Introducing the environmental scan.' *Digital Scholarship in the Humanities*, 38(1), pp.1-22.

Related Organizations
Keywords

Newspapers as Topic/statistics & numerical data, Newspapers as Topic, Newspapers as Topic/history, Newspapers as Topic/classification, Newspapers as Topic/statistics & numerical data

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average