publication . Conference object . Article . 2018

Text Simplification from Professionally Produced Corpora

Carolina Scarton; Paetzold, G. H.; Specia, L.;
Open Access
  • Published: 07 May 2018
Abstract
The lack of large and reliable datasets has been hindering progress in Text Simplification (TS). We investigate the application of the recently created Newsela corpus, the largest collection of professionally written simplifications available, in TS tasks. Using new alignment algorithms, we extract 550,644 complex-simple sentence pairs from the corpus. This data is explored in different ways: (i) we show that traditional readability metrics capture surprisingly well the different complexity levels in this corpus, (ii) we build machine learning models to classify sentences into complex vs. simple and to predict complexity levels that outperform their respective b...
Subjects
ACM Computing Classification System: InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
Funded by
EC| SIMPATICO
Project
SIMPATICO
SIMplifying the interaction with Public Administration Through Information technology for Citizens and cOmpanies
  • Funder: European Commission (EC)
  • Project Code: 692819
  • Funding stream: H2020 | RIA
Validated by funder
Download fromView all 4 versions
Open Access
Zenodo
Conference object . 2018
Provider: Datacite
Open Access
ZENODO
Conference object . 2018
Provider: ZENODO
Open Access
Zenodo
Conference object . 2018
Provider: Datacite
Any information missing or wrong?Report an Issue