
main_dataset.csv This dataset consists of paired articles on identical topics collected from Grokipedia (G) and Wikipedia (W). Each row corresponds to a single topic and contains metadata, structural features, linguistic statistics, similarity measures, and bias/factuality scores for both sources. Identification and Existence Flags title: Canonical topic title. slug: URL-friendly identifier for the topic. exists_grokipedia: Binary indicator of whether the topic exists in Grokpedia. exists_wikipedia: Binary indicator of whether the topic exists in Wikipedia. Structural and Content Features All variables prefixed with a_ refer to Grokipedia, and b_ refer to Wikipedia. Document Structure paragraph_count: Number of paragraphs. heading_count_h1–h4: Number of headings at each HTML level. section_count_h2_h4: Number of sections defined by H2–H4 headings. link_count: Number of internal and external hyperlinks. image_count: Number of embedded images. reference_count: Number of references/citations. Normalized Density Measures refs_per_1k_words: References per 1,000 words. links_per_1k_words: Links per 1,000 words. headings_per_1k_words: Headings per 1,000 words. Word Counts clean_word_count: Number of cleaned (tokenized, stopword-filtered) words. clean_words_alpha: Alphabetic cleaned words only. raw_visible_words_alpha: Alphabetic visible words before cleaning. Lexical and Semantic Similarity Lexical Similarity lexical_tfidf_cosine: Cosine similarity between TF-IDF vectors. lexical_jaccard_unigram: Jaccard similarity over unigram sets. ngram_overlap_1/2/3: Overlap of unigrams, bigrams, and trigrams. Semantic Similarity semantic_embed_cosine: Cosine similarity between sentence embeddings. bertscore_f1: BERTScore F1 semantic similarity. stylistic_similarity: Composite stylistic similarity metric. Linguistic and Readability Features Computed separately for Grokpedia and Wikipedia. Syntactic and Lexical Properties avg_sentence_len: Mean sentence length (words). lexical_diversity: Type-token ratio. lexical_density: Proportion of content words. Readability flesch_kincaid: Flesch–Kincaid grade level. gunning_fog: Gunning Fog index. reading_time_min: Estimated reading time in minutes. POS Distributions pos_noun, pos_verb, pos_adj, pos_adv: Proportions of part-of-speech categories. Raw Text Statistics char_count: Character count. word_count: Word count. sentence_count: Sentence count. Topic and Clustering Metadata topic_gpt: Topic label generated by GPT-based topic modeling. clst_k_means: Cluster ID from k-means clustering. topic_k_means: Human-interpretable topic label from k-means. Bias, Leaning, and Factuality Measures Political Leaning The party leaning metric from this dataset was used to extract the following metrics. leaning_Grokipedia: Estimated political leaning score for Grokipedia. leaning_Wikipedia: Estimated political leaning score for Wikipedia. leaning_diff_G_minus_W: Difference in leaning (Grokipedia − Wikipedia). Bias Scores The bias score from this dataset was used to extract the following metrics. bias_Grokipedia: Overall bias score for Grokipedia. bias_Wikipedia: Overall bias score for Wikipedia. bias_diff_G_minus_W: Bias difference between sources. Factuality The factuality score from this dataset was used to extract the following metrics. factual_Grokipedia: Factuality score for Grokpedia. factual_Wikipedia: Factuality score for Wikipedia. factual_diff_G_minus_W: Difference in factuality. Combined Similarity Score combined_score: Composite score aggregating multiple similarity metrics. ref_domains_per_article_all.csv Each row represents a single referenced domain within a specific article. Identification and Metadata title: Title of the article in which the reference appears. platform: Platform hosting the article (Grokpedia, Wikipedia). rank: Rank of the domain within the article, ordered by frequency of appearance. domain: Referenced domain name (e.g., nytimes.com, foxnews.com). Reference Frequency Measures count_in_article: Number of times the domain is cited within the article. n_refs_found_article: Total number of references found in the article. Source Quality and Ideology These variables are extracted from this dataset. bias: political bias score of the domain. factual_reporting: Factual reliability score of the domain.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
