
This repository contains the distant denoised dataset produced using the DOREMI framework. DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) is an active learning-based system that enhances the training data through targeted manual annotation of highly informative examples. DOREMI operates upstream in the general DocRE pipeline by augmenting the dataset in a model-agnostic fashion, enabling any downstream DocRE model to benefit from improved long-tail coverage. Such an approach results in the production of a Denoised Distantly Supervised Dataset (DDS) that can be used to train any existing DocRE model, demonstrating improvements in long-tail relation predictions. We release four DDSs, which were used for the experimental evalutation of DOREMI. Two datasets (denoted by "DOREMI") were generated by cleaning the DocRED distant dataset by the DOREMI framework utlizing DocRED and Re-DocRED. The other two datasets were generated by an hybrid approach (denoted by "DU"), combining DOREMI long-tail predictions with UGDRE annotations for frequent relations. File Outline All datasets are a denoised version of the DocRED distant dataset. Hence, they all contain the same documents and entities. The DocRED distant dataset consists of 101,873 documents and 1,965,484 entities. The repository contains the following files: DOREMI-DDS-DocRED.json: DDS generated by DOREMI utilizing DocRED. The dataset consists of 1,704,161 positive examples. DOREMI-DDS-ReDocRED.json: DDS generated by DOREMI utilizing Re-DocRED. The dataset consists of 3,957,238 positive examples. DU-DDS-DocRED.json: DDS generated by combining DOREMI (utilizing DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,726,707 positive examples. DU-DDS-ReDocRED.json: DDS generated by combining DOREMI (utilizing Re-DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,796,229 positive examples. Reproducibility Guidelines This section describes how to obtain the results presented in the "Experimental Results" section of the paper. Table 4 (and Tables A8 and B10 (a) of the Technical Appendix) The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-DocRED.json. The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-DocRED.json. Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix)) The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-ReDocRED.json. The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-ReDocRED.json.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
