
Introduction This dataset accompanies the article “Analyzing the history of physics education in the USA and Europe through natural language processing” by Martina Caramaschi and Tor Ole B. Odden (https://doi.org/10.1103/wfvw-hkyy). The dataset contains the cleaned and processed text of articles from two major physics education journals: The Physics Teacher (TPT): 7,203 articles (1963–2020) Physics Education (PE): 6,445 articles (1966–2024) The datasets To prepare the data for analysis, we first reduced the number of scraped articles by applying several filtering and cleaning steps: Removed very short articles (ads, announcements, etc.). Removed documents without listed authors. Corrected malformed titles and metadata issues. Excluded duplicates, book reviews, and journal business (~7,399 removals). Filtered out articles with specific headers (e.g., ANNOUNCEMENTS, BOOK REVIEWS, LETTERS TO THE EDITOR). Excluded errata, corrections, replies, and other non-research content. Discarded articles under 500 words. With the resulting dataset, we improved the correctness of the texts by removing unneeded material appearing before article titles and by cleaning the articles’ content of incorrect or irrelevant sections. After this preprocessing, we tokenized the cleaned texts and created bigrams to prepare the corpus for topic modeling with latent Dirichlet allocation (LDA). After filtering, each document was transformed into a list of individual words (tokens). These tokenized representations were then collected and stored in Python pickle format. Specific datasets included The following files are included in this dataset: 07_bigrams_combined_V2.pkl – combined dataset of all articles from Physics Education and The Physics Teacher, that is a dataframe made by merging the Physics Education data frame and The Physics Teacher dataframe, to obtain one that contains the articles from both journal. This new dataframe is a shuffled and re-indexed version of the combined dataset. The file is stored as a pickled pandas dataframe containing a list of lists: each row corresponds to one article, and each article is represented as a list of sentences, where each sentence is itself a list of tokens (words). During the cleaning process we removed stopwords (e.g., if, and, but), punctuation, numbers, and symbols, then lowercased all words, and merged frequent collocations into bigrams (e.g., high school → high_school). An example represetning 3 sentences from different articles contained into the data frame is: [['calibrate', 'laser', 'power', 'meter', 'holographic', 'work'], ['robot', 'scientist', 'develop', 'student', 'epistemic_insight', 'lesson'], ['new', 'free_fall', 'experiment', 'determine', 'acceleration_gravity', 'kit']] matrix_paper_weights_comb_k20_928.pkl – metadata dataframe containing publication year, title, authors, DOI, and journal for each paper in the combined dataset, in the same order as the processed data (07_bigrams_combined_V2.pkl). In addition, this file includes the LDA topic weights used in the analysis (one column per topic, with per-paper weights summing to 1). These files provide the processed data used for the LDA topic modeling and thematic analysis described in the associated article. The notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results, is contained in the corresponding public GitHub repository https://github.com/martinacaramaschi/TPT-PE-thematic-analysis
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
