Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2026
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

DOREMI Denoised Distantly Supervised Datasets

Authors: Menotti, Laura; MARCHESIN, STEFANO; Silvello, Gianmaria;

DOREMI Denoised Distantly Supervised Datasets

Abstract

This repository contains the distant denoised dataset produced using the DOREMI framework. DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) is an active learning-based system that enhances the training data through targeted manual annotation of highly informative examples. DOREMI operates upstream in the general DocRE pipeline by augmenting the dataset in a model-agnostic fashion, enabling any downstream DocRE model to benefit from improved long-tail coverage. Such an approach results in the production of a Denoised Distantly Supervised Dataset (DDS) that can be used to train any existing DocRE model, demonstrating improvements in long-tail relation predictions. We release four DDSs, which were used for the experimental evalutation of DOREMI. Two datasets (denoted by "DOREMI") were generated by cleaning the DocRED distant dataset by the DOREMI framework utlizing DocRED and Re-DocRED. The other two datasets were generated by an hybrid approach (denoted by "DU"), combining DOREMI long-tail predictions with UGDRE annotations for frequent relations. File Outline All datasets are a denoised version of the DocRED distant dataset. Hence, they all contain the same documents and entities. The DocRED distant dataset consists of 101,873 documents and 1,965,484 entities. The repository contains the following files: DOREMI-DDS-DocRED.json: DDS generated by DOREMI utilizing DocRED. The dataset consists of 1,704,161 positive examples. DOREMI-DDS-ReDocRED.json: DDS generated by DOREMI utilizing Re-DocRED. The dataset consists of 3,957,238 positive examples. DU-DDS-DocRED.json: DDS generated by combining DOREMI (utilizing DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,726,707 positive examples. DU-DDS-ReDocRED.json: DDS generated by combining DOREMI (utilizing Re-DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,796,229 positive examples. Reproducibility Guidelines This section describes how to obtain the results presented in the "Experimental Results" section of the paper. Table 4 (and Tables A8 and B10 (a) of the Technical Appendix) The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-DocRED.json. The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-DocRED.json. Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix)) The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-ReDocRED.json. The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-ReDocRED.json.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average