SMIDGE Daily Mail comments dataset

The dataset of user comments was sourced from the online platform of the Daily Mail. A custom Python-based web scraping tool was developed to systematically extract data from articles published during the calendar year 2021. This initial process retrieved a comprehensive corpus of 224,981 articles and successfully downloaded over 41 million associated user comments. For each comment, relevant metadata was collected, including the comment text, user ID, timestamp, and community feedback metrics such as positive and negative votes. The dataset provided for analysis is a random sample of 150,000 user comments drawn from this extensive 2021 collection. To ensure the suitability of the data for in-depth textual analysis, a filtering criterion was applied to the sampling process. The resulting sample exclusively contains comments with a minimum length of at least 20 words. This step was implemented to isolate more substantive comments, making the dataset particularly well-suited for further analytical tasks such as topic modeling, sentiment analysis, and detailed qualitative examination. Column description: · RowID: Sequential row identifier within the exported dataset. · AssetId: Identifier of the Daily Mail article to which the comment belongs. · category: Content category/section of the article (e.g. news, sport, femail, tvshowbiz). · custom_id: Unique identifier of the comment. · AssetHeadline: Headline/title of the article. · DateCreated: Date and time when the comment was created; stored in the file as a numeric date value. · AssetCommentCount: Total number of comments associated with the article. · AssetUrl: URL path of the corresponding Daily Mail article. · message: Full text of the user comment. · year: Year of publication/collection of the comment (2021). · VoteCount: Total number of votes received by the comment. · VoteRating: Net rating of the comment, calculated as positive votes minus negative votes. · pos_votes: Number of positive votes received by the comment. · neg_votes: Number of negative votes received by the comment.

Found an issue? Give us feedback

Funded by

EC| SMIDGE