Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2020
License: CC BY NC
Data sources: Datacite
ZENODO
Dataset . 2020
License: CC BY NC
Data sources: Datacite
versions View all 2 versions
addClaim

Science Education Research Topic Modeling Dataset

Authors: Odden, Tor Ole B.; Marin, Alessandro; Rudolph, John L.;

Science Education Research Topic Modeling Dataset

Abstract

This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal. The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way: We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text. We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”. We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”) We removed all numbers, symbols, special characters, and punctuation, and lowercased all words. We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words. We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs). We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”. After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl. In addition to this file, we have also included the following files: SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook. This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to "communicate TDM Output to third parties as part of original non-commercial research carried out by User, including in articles that describe, analyse and interpret research. Publications or analyses resulting from TDM of Wiley Content may include brief quotations from the original text as permitted under Section 107 or 108 of the 1976 United States Copyright Act in the United States, or as permitted by other applicable national copyright laws internationally. Any such extracts, as well as bibliographic metadata, must cite the original Wiley Content in the form of a DOI link. Permission to reproduce images shall be required in accordance with clause 5." The full text of the agreement is available here: https://olabout.wiley.com/WileyCDA/Section/id-826542.html

Related Organizations
Keywords

Historical Analysis, Science Education, Topic Modeling, Latent Dirichlet Allocation, Natural Language Processing

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 29
    download downloads 12
  • 29
    views
    12
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
29
12