Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Conference object . 2025
License: CC BY
Data sources: Datacite
ZENODO
Conference object . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

From Idea to Prototype: Using BITS and LLMs to automate the annotation process for SGN Collection Data

Authors: Wolodkin, Alexander; Martens, Claudia;

From Idea to Prototype: Using BITS and LLMs to automate the annotation process for SGN Collection Data

Abstract

We will present a workflow at SGN that combines the usage of BITS (https://projects.tib.eu/bits/home) outcome (i.e. ESS collection of the TIB TS) with GPT4all in order to identify gaps in terminologies on the one hand, and provide assistance to scientists, who are working on new collections on the other hand. Based on two major data management challenges facing SGN, Legacy Data Digitisation (historical grown data require systematic transformation into machine-readable formats) and Data Proliferation Management (continuous input of data generated by ongoing collection efforts and research activities), our prototyping process can be divided into several areas: Identifying nominal phrases (NPs) in the collection data and annotating them using BITS TS. Our primary goal was to achieve reliable detection, with a focus on minimising false negatives, while accepting some false positives during annotation. During the prototyping phase, several obstacles were encountered referring to poor NP detection quality in scientific texts and a lack of reliability in conjunction splitting and singularization using common tools. It is also not always possible to determine the correct language of the text, especially with mixed-language content. Revising our requirements had let us choose GPT4all as our preferred solution, specifically the Meta-Llama-3-8B-Instruct.Q4_0.gguf model. This allows us to perform high quality NP detection and transformation, but with very high computational and time requirements. To optimise resource utilisation, GPT4all is employed only for high-level operations. Other operations can be performed by tools with less hardware requirements. Using statistical logging allows us to identify various significant information about the NP detection and usage. This data we can reuse in later development steps. By leveraging the strengths of BITS and GPT4all, SGN is paving the way for more accurate processing of complex scientific data to improve research outcomes.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green