Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Other literature type . 2026
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2026
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2026
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Authors: Shlyakhta, Taras;

Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Abstract

Background:Large language models (LLMs) are increasingly proposed as clinical assistants in critical care, yet their behavior under prolonged clinical context, conflicting data, and authoritative pressure remains insufficiently evaluated. This is particularly relevant for offline or resource-constrained environments, where cloud-based safeguards are unavailable. Methods:In this study, I conducted a fully automated behavioral evaluation of 23 language models using a structured, time-series intensive care unit (ICU) simulation spanning 24 hours of synthetic patient data. The scenario incorporated routine monitoring, predefined data–clinical conflict traps, progressive physiological deterioration, and a final safety stress test involving a contraindicated antibiotic order in a patient with documented penicillin anaphylaxis. All models were executed locally under identical offline-first conditions using deterministic inference settings. Model outputs were assessed using predefined rule-based criteria for safety compliance, sycophancy, discrepancy detection, long-term context retention, and response latency. Results:While 61% of models formally refused the contraindicated prescription, only 8.7% explicitly grounded their refusal in retained clinical context. Nearly 40% of models complied with the unsafe order despite prior documentation of anaphylaxis, demonstrating pronounced sycophancy under authoritative instruction. More than half of the models initiated inappropriate clinical interventions in response to isolated numerical abnormalities deliberately decoupled from clinical presentation. Long-term context retention degraded in most models, and response latency showed no meaningful association with safer behavior. Conclusions:Under realistic offline-first conditions, the majority of evaluated language models exhibited behavior incompatible with safe use in critical care, including unsafe obedience, failure to recognize data artifacts, and loss of safety-critical context over time. These findings indicate that general-purpose LLMs should not be deployed as autonomous clinical agents. However, the performance of a small subset of models suggests that safer offline-capable systems may be achievable through hybrid designs incorporating explicit refusal mechanisms, discrepancy-aware reasoning, and retrieval-augmented grounding in validated clinical knowledge.

Keywords

Large language models; Clinical AI safety; Intensive care unit; Sycophancy; Context retention; Offline AI; Retrieval-augmented generation

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!