research data . Dataset . 2017

Webis-TLDR-17 Corpus

Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno;
Open Access English
  • Published: 07 Nov 2017
  • Publisher: Zenodo
Abstract
<p>This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:</p> <ul> <li>author: string (nullable = true)</li> <li>body: string (nullable = true)</li> <li>normalizedBody: string (nullable = true)</li> <li>content: string (nullable = true)</li> <li>content_len: long (nullable = true)</li> <li>summary: string (nullable = true)</li> <li>summary_len: long (nullable = true)</li> <li>id: string (nullable = true)</li> <li>subreddit: string (nullable = true)</li> <li>subreddit_id: stri...
Subjects
free text keywords: tl;dr, Abstractive Summarization, Social Media Dataset, Biochemistry, Cell Biology, Genetics, Sociology, Marine Biology, Science Policy, 60506 Virology, 69999 Biological Sciences not elsewhere classified, 80699 Information Systems not elsewhere classified
Communities
Science and Innovation Policy Studies
Download fromView all 4 versions
Zenodo
Dataset . 2017
Provider: Zenodo
Zenodo
Dataset . 2017
Provider: Datacite
Zenodo
Dataset . 2017
Provider: Datacite
figshare
Dataset . 2017
Provider: figshare
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue