Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

TPT–PE Thematic Analysis Dataset

Authors: Caramaschi, Martina; Odden, Tor Ole Bigton;

TPT–PE Thematic Analysis Dataset

Abstract

Introduction This dataset accompanies the article “Analyzing the history of physics education in the USA and Europe through natural language processing” by Martina Caramaschi and Tor Ole B. Odden (https://doi.org/10.1103/wfvw-hkyy). The dataset contains the cleaned and processed text of articles from two major physics education journals: The Physics Teacher (TPT): 7,203 articles (1963–2020) Physics Education (PE): 6,445 articles (1966–2024) The datasets To prepare the data for analysis, we first reduced the number of scraped articles by applying several filtering and cleaning steps: Removed very short articles (ads, announcements, etc.). Removed documents without listed authors. Corrected malformed titles and metadata issues. Excluded duplicates, book reviews, and journal business (~7,399 removals). Filtered out articles with specific headers (e.g., ANNOUNCEMENTS, BOOK REVIEWS, LETTERS TO THE EDITOR). Excluded errata, corrections, replies, and other non-research content. Discarded articles under 500 words. With the resulting dataset, we improved the correctness of the texts by removing unneeded material appearing before article titles and by cleaning the articles’ content of incorrect or irrelevant sections. After this preprocessing, we tokenized the cleaned texts and created bigrams to prepare the corpus for topic modeling with latent Dirichlet allocation (LDA). After filtering, each document was transformed into a list of individual words (tokens). These tokenized representations were then collected and stored in Python pickle format. Specific datasets included The following files are included in this dataset: 07_bigrams_combined_V2.pkl – combined dataset of all articles from Physics Education and The Physics Teacher, that is a dataframe made by merging the Physics Education data frame and The Physics Teacher dataframe, to obtain one that contains the articles from both journal. This new dataframe is a shuffled and re-indexed version of the combined dataset. The file is stored as a pickled pandas dataframe containing a list of lists: each row corresponds to one article, and each article is represented as a list of sentences, where each sentence is itself a list of tokens (words). During the cleaning process we removed stopwords (e.g., if, and, but), punctuation, numbers, and symbols, then lowercased all words, and merged frequent collocations into bigrams (e.g., high school → high_school). An example represetning 3 sentences from different articles contained into the data frame is: [['calibrate', 'laser', 'power', 'meter', 'holographic', 'work'], ['robot', 'scientist', 'develop', 'student', 'epistemic_insight', 'lesson'], ['new', 'free_fall', 'experiment', 'determine', 'acceleration_gravity', 'kit']] matrix_paper_weights_comb_k20_928.pkl – metadata dataframe containing publication year, title, authors, DOI, and journal for each paper in the combined dataset, in the same order as the processed data (07_bigrams_combined_V2.pkl). In addition, this file includes the LDA topic weights used in the analysis (one column per topic, with per-paper weights summing to 1). These files provide the processed data used for the LDA topic modeling and thematic analysis described in the associated article. The notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results, is contained in the corresponding public GitHub repository https://github.com/martinacaramaschi/TPT-PE-thematic-analysis

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average