DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 01 Jan 2024 Belgium English Publisher:Oxford University Press (OUP)Journal:Database, volume 2,024 (eissn: 1758-0463,

Copyright policy )

Authors: Charlotte Nachtegael; Jacopo De Stefani; Anthony Cnudde; Tom Lenaerts;

doi: 10.1093/database/baae039

pmid: 38805753

pmc: PMC11131422

handle: 20.500.14017/a14918e6-936a-4be8-a600-21b5b9549ca4 , 2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/374632

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

Abstract While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene–variant–gene–variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571

Country

Belgium

Related Organizations

Vrije Universiteit Brussel
Belgium
Université Libre de Bruxelles
Belgium
Institut Barcelona d'Estudis Internacionals
Spain

Keywords

Data Curation/methods, Informatique générale, biomedical relation extraction dataset, Intelligence artificielle, Data Mining/methods, genetic diseases, active learning, Databases, Genetic, Humans, Data Mining, Original Article, Supervised Machine Learning, Data Curation

1 Research products, page 1 of 1

DUVEL software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average