
Annotation of research data is a key element of Open Science and has gained additional value as training input for artificial intelligence. However, developing metadata schemas poses a series of challenges, including optimisation and securing both complete coverage and constant completeness and quality. We employ large language models (LLMs) to address some of these challenges while keeping researchers in the loop to ensure reliability of annotations.Our research data management group currently supports seven biomedical research consortia. We develop customised metadata schemas together with consortium members, drawing on established controlled vocabularies (Engel et al. 2025). Schemas are implemented on the fredato research data platform developed at the IMBI (Watter et al. 2023). Schemas are documented and published as knowledge graphs adhering to the Resource Description Framework (RDF), relating metadata to research processes as modelled by commonly used ontologies.LLMs are employed to develop initial schema drafts from related research literature and to predict dataset annotations from scientific papers (Giuliani et al. 2025). The models have proved to perform well with these tasks, supporting researchers with improving metadata coverage in their consortia.
Metadata annotation prediction, Metadata, Large Language Models, Research data management, Data annotation, Data schemas
Metadata annotation prediction, Metadata, Large Language Models, Research data management, Data annotation, Data schemas
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
