NewsWords Data (Word Counts)

NewsWords Word Counts from the British Library's Digitised Newspaper Collections Description The NewsWords dataset contains word count data derived from newspapers published in Britain during the "long" nineteenth century (1780-1920) and digitised as of 2025. These frequencies are computed from the British Library's collection. The tar file contains 269,179 JSON files. Each file captures the word counts for one month for one newspaper title. The filenames are structured as follows: "{newspaper_id}_{year}_{month}.json", e.g. "0003281_1896_07.json". Each file consists of a dictionary mapping words to their frequencies, e.g. {"newspaper": 19, "transmission": 11}. Together, the word counts represent a corpus of 120 billion tokens based on a vocabulary of 200k unique words appearing more than five times. Please follow this link to view a bar chart that breaks down the word counts by decade. The newspaper_id corresponds with NLP ids, which are documented in the British Library newspaper catalogue: > Ryan, Yann, and Luke McKernan. 2021. “Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications”. *Journal of Open Humanities Data* 7 (0): 1. https://doi.org/10.5334/johd.23. Complete metadata for this newspaper collection is available in another open dataset: > Westerling, Kalle, Timothy Hobson, Kaspar Beelen, Nilo Pedrazzini, Daniel Wilson, and Katherine McDonough. “Lwmdb Data”. *Zenodo*, December 11, 2024. https://doi.org/10.5281/zenodo.14389180. Code The NewsWords Code GitHub repository provides code for converting "raw" word counts to a more manageable sparse matrix format and contextualises these counts with additional newspaper metadata, e.g. information about price and politics. Further Information about how to use the code and query the NewsWords data is available in the GitHub README. To recreate these sparse matrices, please follow the instructions in "Create_sparse_matrices.ipynb" Limitations These word counts are derived from the digitised press, containing billions of words, spanning multiple decades. However large, these data constitute around ##% of the number of newspaper titles that circulated in Great Britain. In our paper "Whose News? Critical methods for assessing bias in large historical datasets" (under review) we have tackled the issue of representativeness, and point out that these exhibit some partisan bias—in the sense that they overrepresent conservative and liberal newspaper titles—which varies over the 19th century. > "Whose News? Critical methods for assessing bias inlarge historical datasets" (under review) For more information about the method and data see also: > Beelen, Kaspar, Jon Lawrence, Daniel C Wilson, and David Beavan, 2023. 'Bias and representativeness in digitized newspaper collections: Introducing the environmental scan.' *Digital Scholarship in the Humanities*, 38(1), pp.1-22.

Related Organizations

The Alan Turing Institute
United Kingdom
School of Advanced Study
United Kingdom

Keywords

Newspapers as Topic/statistics & numerical data, Newspapers as Topic, Newspapers as Topic/history, Newspapers as Topic/classification, Newspapers as Topic/statistics & numerical data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Fields of Science

social sciences

media and communications

Fields of Science

social sciences

media and communications