Name: Extracting Information in a Low-Resource Setting: Case Study on Bioinformatics Workflows
Keywords: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Computation and Language, Bioinformatics Workflows, [INFO] Computer Science [cs], Information Extraction, Computation and Language (cs.CL), Natural Language Processing, [INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM]

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Article , Preprint , Conference object 01 Jan 2025Embargo end date: 01 Jan 2024 English Publisher:Springer Nature Switzerland

Authors: Sebe, Clémence; Cohen-Boulakia, Sarah; Ferret, Olivier; Névéol, Aurélie;

doi: 10.1007/978-3-031-91398-3_21 , 10.48550/arxiv.2411.19295

arXiv: http://arxiv.org/abs/2411.19295

Extracting Information in a Low-Resource Setting: Case Study on Bioinformatics Workflows

- Summary
- Subjects
- Metrics

Abstract

Bioinformatics workflows are essential for complex biological data analyses and are often described in scientific articles with source code in public repositories. Extracting detailed workflow information from articles can improve accessibility and reusability but is hindered by limited annotated corpora. To address this, we framed the problem as a low-resource extraction task and tested four strategies: 1) creating a tailored annotated corpus, 2) few-shot named-entity recognition (NER) with an autoregressive language model, 3) NER using masked language models with existing and new corpora, and 4) integrating workflow knowledge into NER models. Using BioToFlow, a new corpus of 52 articles annotated with 16 entities, a SciBERT-based NER model achieved a 70.4 F-measure, comparable to inter-annotator agreement. While knowledge integration improved performance for specific entities, it was less effective across the entire information schema. Our results demonstrate that high-performance information extraction for bioinformatics workflows is achievable.

Related Organizations

View all View all

Keywords

[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Computation and Language, Bioinformatics Workflows, [INFO] Computer Science [cs], Information Extraction, Computation and Language (cs.CL), Natural Language Processing, [INFO.INFO-BI] Computer Science [cs]/Bioinformatics [q-bio.QM]

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Related to Research communities

Digital Humanities and Cultural Heritage

INRIA