
Collecting the relevant list of patient phenotypes,known as deep phenotyping, can significantly improve the finaldiagnosis. As textual clinical reports are the richest source ofphenotypes information, their automatic extraction is a criticaltask. The main challenges of this Information Extraction (IE) taskare to identify precisely the text spans related to a phenotype andto link them unequivocally to referenced entities from a sourcesuch as the Human Phenotype Ontology (HPO).Recently, Language Models (LMs) have been the most suc-cessful approach for extracting phenotypes from clinical reports.Solutions such as PhenoBERT, relying on BERT or GPT, haveshown promising results when applied to datasets built on thehypothesis that most phenotypes are explicitly mentioned in thetext. However, this assumption is not always true in medicalgenetics. Hence, although the LMs carry powerful semanticabilities, their contributions are not clear compared to syntacticstring-matching steps that are used within the current pipelines.The goal of this study is to improve phenotype extraction fromclinical notes related to genetic diseases. Our contributions arethreefold: First, we provide a clear definition of the phenotypeextraction task from free text, along with a high-level overview ofthe involved functions. Second, we conduct an in-depth analysisof PhenoBERT, one of the best existing solutions, to evaluate theproportion of phenotypes predicted with simple string-matching.Third, we demonstrate how utilizing and incorporating largelanguage models (LLMs) for span detection step can improveperformance especially with implicit phenotypes. In addition, thisexperiment revealed that the annotations of existing dataset arenot exhaustive, and that LLM can identify relevant spans missedby human labelers.
LLM, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], phenotype, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, phenoBERT, [INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], genetic, entity linking, embeddings
LLM, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], phenotype, [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing, phenoBERT, [INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], genetic, entity linking, embeddings
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
