Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset
Data sources: ZENODO
addClaim

SMIDGE Daily Mail comments dataset

Authors: Gulas, Christian;

SMIDGE Daily Mail comments dataset

Abstract

The dataset of user comments was sourced from the online platform of the Daily Mail. A custom Python-based web scraping tool was developed to systematically extract data from articles published during the calendar year 2021. This initial process retrieved a comprehensive corpus of 224,981 articles and successfully downloaded over 41 million associated user comments. For each comment, relevant metadata was collected, including the comment text, user ID, timestamp, and community feedback metrics such as positive and negative votes. The dataset provided for analysis is a random sample of 150,000 user comments drawn from this extensive 2021 collection. To ensure the suitability of the data for in-depth textual analysis, a filtering criterion was applied to the sampling process. The resulting sample exclusively contains comments with a minimum length of at least 20 words. This step was implemented to isolate more substantive comments, making the dataset particularly well-suited for further analytical tasks such as topic modeling, sentiment analysis, and detailed qualitative examination. Column description: · RowID: Sequential row identifier within the exported dataset. · AssetId: Identifier of the Daily Mail article to which the comment belongs. · category: Content category/section of the article (e.g. news, sport, femail, tvshowbiz). · custom_id: Unique identifier of the comment. · AssetHeadline: Headline/title of the article. · DateCreated: Date and time when the comment was created; stored in the file as a numeric date value. · AssetCommentCount: Total number of comments associated with the article. · AssetUrl: URL path of the corresponding Daily Mail article. · message: Full text of the user comment. · year: Year of publication/collection of the comment (2021). · VoteCount: Total number of votes received by the comment. · VoteRating: Net rating of the comment, calculated as positive votes minus negative votes. · pos_votes: Number of positive votes received by the comment. · neg_votes: Number of negative votes received by the comment.

Powered by OpenAIRE graph
Found an issue? Give us feedback
Funded by