Selective divergence between Grokipedia and Wikipedia articles

main_dataset.csv This dataset consists of paired articles on identical topics collected from Grokipedia (G) and Wikipedia (W). Each row corresponds to a single topic and contains metadata, structural features, linguistic statistics, similarity measures, and bias/factuality scores for both sources. Identification and Existence Flags title: Canonical topic title. slug: URL-friendly identifier for the topic. exists_grokipedia: Binary indicator of whether the topic exists in Grokpedia. exists_wikipedia: Binary indicator of whether the topic exists in Wikipedia. Structural and Content Features All variables prefixed with a_ refer to Grokipedia, and b_ refer to Wikipedia. Document Structure paragraph_count: Number of paragraphs. heading_count_h1–h4: Number of headings at each HTML level. section_count_h2_h4: Number of sections defined by H2–H4 headings. link_count: Number of internal and external hyperlinks. image_count: Number of embedded images. reference_count: Number of references/citations. Normalized Density Measures refs_per_1k_words: References per 1,000 words. links_per_1k_words: Links per 1,000 words. headings_per_1k_words: Headings per 1,000 words. Word Counts clean_word_count: Number of cleaned (tokenized, stopword-filtered) words. clean_words_alpha: Alphabetic cleaned words only. raw_visible_words_alpha: Alphabetic visible words before cleaning. Lexical and Semantic Similarity Lexical Similarity lexical_tfidf_cosine: Cosine similarity between TF-IDF vectors. lexical_jaccard_unigram: Jaccard similarity over unigram sets. ngram_overlap_1/2/3: Overlap of unigrams, bigrams, and trigrams. Semantic Similarity semantic_embed_cosine: Cosine similarity between sentence embeddings. bertscore_f1: BERTScore F1 semantic similarity. stylistic_similarity: Composite stylistic similarity metric. Linguistic and Readability Features Computed separately for Grokpedia and Wikipedia. Syntactic and Lexical Properties avg_sentence_len: Mean sentence length (words). lexical_diversity: Type-token ratio. lexical_density: Proportion of content words. Readability flesch_kincaid: Flesch–Kincaid grade level. gunning_fog: Gunning Fog index. reading_time_min: Estimated reading time in minutes. POS Distributions pos_noun, pos_verb, pos_adj, pos_adv: Proportions of part-of-speech categories. Raw Text Statistics char_count: Character count. word_count: Word count. sentence_count: Sentence count. Topic and Clustering Metadata topic_gpt: Topic label generated by GPT-based topic modeling. clst_k_means: Cluster ID from k-means clustering. topic_k_means: Human-interpretable topic label from k-means. Bias, Leaning, and Factuality Measures Political Leaning The party leaning metric from this dataset was used to extract the following metrics. leaning_Grokipedia: Estimated political leaning score for Grokipedia. leaning_Wikipedia: Estimated political leaning score for Wikipedia. leaning_diff_G_minus_W: Difference in leaning (Grokipedia − Wikipedia). Bias Scores The bias score from this dataset was used to extract the following metrics. bias_Grokipedia: Overall bias score for Grokipedia. bias_Wikipedia: Overall bias score for Wikipedia. bias_diff_G_minus_W: Bias difference between sources. Factuality The factuality score from this dataset was used to extract the following metrics. factual_Grokipedia: Factuality score for Grokpedia. factual_Wikipedia: Factuality score for Wikipedia. factual_diff_G_minus_W: Difference in factuality. Combined Similarity Score combined_score: Composite score aggregating multiple similarity metrics. ref_domains_per_article_all.csv Each row represents a single referenced domain within a specific article. Identification and Metadata title: Title of the article in which the reference appears. platform: Platform hosting the article (Grokpedia, Wikipedia). rank: Rank of the domain within the article, ordered by frequency of appearance. domain: Referenced domain name (e.g., nytimes.com, foxnews.com). Reference Frequency Measures count_in_article: Number of times the domain is cited within the article. n_refs_found_article: Total number of references found in the article. Source Quality and Ideology These variables are extracted from this dataset. bias: political bias score of the domain. factual_reporting: Factual reliability score of the domain.

Related Organizations

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average