<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice"

Name: Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice"
Creator: Rozado, David
Keywords: 10. No inequality, 16. Peace & justice, Social justice, Academic literature, Prejudice, News media

Research datakeyboard_double_arrow_right Dataset 10 Jan 2022 English Publisher:Zenodo

Authors: Rozado, David;

doi: 10.5281/zenodo.5832065 , 10.5281/zenodo.5832064

Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice"

- Summary
- Subjects
- Related research
  (3)
- Metrics

Abstract

This data set contains frequency counts of target words in 175 million academic abstracts published in all fields of knowledge. We quantify the prevalence of words denoting prejudice against ethnicity, gender, sexual orientation, gender identity, minority religious sentiment, age, body weight and disability in SSORC abstracts over the period 1970-2020. We then examine the relationship between the prevalence of such terms in the academic literature and their concomitant prevalence in news media content. We also analyze the temporal dynamics of an additional set of terms associated with social justice discourse in both the scholarly literature and in news media content. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes. The list of academic abstracts analyzed in this work was taken from the Semantic Scholar Open Research Corpus (SSORC). The corpus contains, as of 2020, over 175 million academic abstracts, and associated metadata, published in all fields of knowledge. The raw data is provided by Semantic Scholar in accessible JSON format. Textual content included in our analysis is circumscribed to the scholarly articles’ titles and abstracts and does not include other article elements such as main body of text or references section. Thus, we use frequency counts derived from academic articles’ titles and abstracts as a proxy for word prevalence in those articles. This proxy was used because the SSORC corpus does not provide the entire text body of the indexed articles. Targeted textual content was located in JSON data and sorted by year to facilitate chronological analysis. Tokens were lowercased prior to estimating frequency counts. Yearly relative frequencies of a target word or n-gram in the SSORC corpus were estimated by dividing the number of occurrences of the target word/n-gram in all scholarly articles within a given year by the total number of all words in all articles of that year. This method of estimating word frequencies accounts for variable volume of total scientific output over time. This approach has been shown before to accurately capture the temporal dynamics of historical events and social trends in news media corpora. It is possible that a small percentage of scholarly articles in the SSORC corpus contain incorrect or missing data. For earlier years in the SSORC corpus, abstract information is sometimes missing and only article’s title information is available. As a result, the total and target word count metrics for a small subset of academic abstracts might not be precise. In a data analysis of 175 million scientific abstracts, manually checking the accuracy of frequency counts for every single academic abstract is unfeasible and hundred percent accuracy at capturing abstracts’ content might be elusive due to a small number of erroneous outlier cases in the raw data. Overall, however, we are confident that our frequency metrics are representative of word prevalence in academic content as illustrated by Figure 2 in the main manuscript, which shows the chronological prevalence in the SSORC corpus of several terms associated with different disciplines of scientific/academic knowledge. Factor analysis of frequency counts time series was carried out only after Bartlett’s test of sphericity and Kaiser-Meyer-Olkin (KMO) test confirmed the suitability of the data for factor analysis. A single factor derived from the frequency counts time series of prejudice-denoting terms was extracted from each corpus (academic abstracts and news media content). The same procedure was applied for the terms denoting social justice discourse. A factor loading cutoff of 0.5 was used to ascribe terms to a factor. Chronbach alphas to determine if the resulting factors appeared coherent were extremely high (>0.95). The textual content of news and opinion articles from the outlets listed in Figure 5 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions. Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time. The list of compressed files in this data set is listed next: -analysisScripts.rar contains the analysis scripts used in the main manuscript and raw data metrics -scholarlyArticlesContainingTargetWords.rar contains the IDs of each analyzed abstract in the SSORC corpus and the counts of target words and total words for each scholarly article -targetWordsInMediaArticlesCounts.rar contains counts of target words in news outlets articles as well as total counts of words in articles In a small percentage of news articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise. In a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Rozado, Al-Gharbi, and Halberstadt, “Prevalence of Prejudice-Denoting Words in News Media Discourse" for supporting evidence). 31/08/2022 Update: There is a new way to download the Semantic Scholar Open Research Corpus (see https://github.com/allenai/s2orc). This updated version states that the corpus contains 136M+ paper nodes. However, when I downloaded a previous version of the corpus in 2021 from http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I counted 175M unique identifiers. The URL of the previous version of the corpus is no longer active, but it has been cached by the Internet Archive at https://web.archive.org/web/20201030131959/http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/ I haven't had the time to look at the specific reason for the mismatch but perhaps the newer version of the corpus has cleaned a lot of noisy entries in the previous version which often contained entries with missing abstracts. Filtering out entries in low prevalence languages other than English might be another reason. In any case, Figure 2 of the main manuscript of this work (at https://www.nas.org/academic-questions/35/2/themes-in-academic-literature-prejudice-and-social-justice) should provide support for the validity of the frequency counts.

Related Organizations

Otago Polytechnic
New Zealand

Keywords

Social justice, Academic literature, Prejudice, News media

3 Research products, page 1 of 1

Data for manuscript: "Prevalence of prejudice denoting words in news media discourse: a chronological analysis"
2021IsAmongTopNSimilarDocuments
Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left Political Extremism in U.S. and U.K. News Media"
2021IsAmongTopNSimilarDocuments
Data for manuscript: "The Prevalence of Prejudice Denoting Terms in Spanish Newspapers"
2021IsAmongTopNSimilarDocuments

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility

views

3
views
Powered by

Found an issue? Give us feedback

visibility

Average

Beta

SDGs Suggest

10. No inequality

16. Peace & justice

Beta

SDGs:

10. No inequality,

16. Peace & justice

Related to Research communities

Knowmad Institut

Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice"

Data for manuscript: "Themes in Academic Literature: Prejudice and Social Justice"

3 Research products, page 1 of 1

Data for manuscript: "Prevalence of prejudice denoting words in news media discourse: a chronological analysis"

Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left Political Extremism in U.S. and U.K. News Media"

Data for manuscript: "The Prevalence of Prejudice Denoting Terms in Spanish Newspapers"