Phenotypes Extraction from Text: Analysis and Perspective in the LLM Era

Name: Phenotypes Extraction from Text: Analysis and Perspective in the LLM Era
Keywords: LLM, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], phenotype, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, phenoBERT, [INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], genetic, entity linking, embeddings

Baddour, Moussa; Paquelet, Stéphane; Rollier, Paul; Tayrac, Marie; Dameron, Olivier; Labbé, Thomas

Found an issue? Give us feedback

INRIA2arrow_drop_down

INRIA2

Conference object . 2024

Data sources: INRIA2

HAL-Rennes 1

Conference object . 2024

Data sources: HAL-Rennes 1

INRIA a CCSD electronic archive server

Conference object . 2024

Data sources: INRIA a CCSD electronic archive server

https://doi.org/10.1109/is6175...

Article . 2024 . Peer-reviewed

License: STM Policy #29

Data sources: Crossref

Phenotypes Extraction from Text: Analysis and Perspective in the LLM Era

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 29 Aug 2024Publisher:IEEEJournal:2024 IEEE 12th International Conference on Intelligent Systems (IS)

Authors: Baddour, Moussa; Paquelet, Stéphane; Rollier, Paul; Tayrac, Marie; Dameron, Olivier; Labbé, Thomas;

doi: 10.1109/is61756.2024.10705235

Phenotypes Extraction from Text: Analysis and Perspective in the LLM Era

- Summary
- Subjects
- Metrics

Abstract

Collecting the relevant list of patient phenotypes,known as deep phenotyping, can significantly improve the finaldiagnosis. As textual clinical reports are the richest source ofphenotypes information, their automatic extraction is a criticaltask. The main challenges of this Information Extraction (IE) taskare to identify precisely the text spans related to a phenotype andto link them unequivocally to referenced entities from a sourcesuch as the Human Phenotype Ontology (HPO).Recently, Language Models (LMs) have been the most suc-cessful approach for extracting phenotypes from clinical reports.Solutions such as PhenoBERT, relying on BERT or GPT, haveshown promising results when applied to datasets built on thehypothesis that most phenotypes are explicitly mentioned in thetext. However, this assumption is not always true in medicalgenetics. Hence, although the LMs carry powerful semanticabilities, their contributions are not clear compared to syntacticstring-matching steps that are used within the current pipelines.The goal of this study is to improve phenotype extraction fromclinical notes related to genetic diseases. Our contributions arethreefold: First, we provide a clear definition of the phenotypeextraction task from free text, along with a high-level overview ofthe involved functions. Second, we conduct an in-depth analysisof PhenoBERT, one of the best existing solutions, to evaluate theproportion of phenotypes predicted with simple string-matching.Third, we demonstrate how utilizing and incorporating largelanguage models (LLMs) for span detection step can improveperformance especially with implicit phenotypes. In addition, thisexperiment revealed that the annotations of existing dataset arenot exhaustive, and that LLM can identify relevant spans missedby human labelers.

Related Organizations

B-com Institute of Research and Technology
France
French National Centre for Scientific Research
France
Centre Hospitalier Universitaire de Rennes
France
Institut de Recherche en Informatique et Systèmes Aléatoires
France
Université de Rennes 1
France

View all View all

Keywords

LLM, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], phenotype, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, phenoBERT, [INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], genetic, entity linking, embeddings

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

INRIA