Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Reddit Comments Dataset for Text Style Transfer Tasks

Authors: Kopf, Fabian;

Reddit Comments Dataset for Text Style Transfer Tasks

Abstract

Reddit Comments Dataset for Text Style Transfer Tasks A dataset of Reddit comments prepared for Text Style Transfer Tasks. The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used: "Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {" This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models. The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews. The quality of formal translations was assessed with BERTScore and chrF++: BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88 chrF++: 37.16 The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77 The dataset consists of 3 components. reddit_commments.csv This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected: - subreddit (name of the subreddit in which the comment was posted) - id (ID of the comment) - submission_id (ID of the submission to which the comment was posted) - body (the comment itself) - created_utc (timestamp in seconds) - parent_id (The ID of the comment or submission to which the comment is a reply) - permalink (The URL to the original comment)- - token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer) - perplexity (What perplexity does GPT-2 calculate for the comment) The comments were filtered. This file contains only comments that: - have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens. - are not [removed] or [deleted] - do not contain URLs This file was used as a source for the other two file types. Labeled Files (training_labeled.csv and eval_labeled.csv) These files contain the formal translations of the Reddit comments. The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience. They are structured as follows: - Subreddit (name of the subreddit where the comment was posted). - Original Comment - Formal Comment Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json) These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment. These files can be used to train models to perform style transfers based on given examples. The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples. An entry in this file is structured as follows: "data":[ { "input_sentence":"The original Reddit comment", "style_samples":[ "sample1", "sample2", "sample3" ], "results_sentence":"The formal translated input_sentence", "subreddit":"The subreddit from which the comments originated" }, "..." ]

{"references": ["Reif, Emily et al. (2022) A Recipe For Arbitrary Text Style Transfer with Large Language Models"]}

Related Organizations
Keywords

Text Style Transfer, NLG, NLP

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 146
    download downloads 166
  • 146
    views
    166
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
146
166