Behavioral Safety and Context Retention of Large Language Models in a Longitudinal ICU Simulation under Offline Conditions

Background:Large language models (LLMs) are increasingly proposed as clinical assistants in critical care, yet their behavior under prolonged clinical context, conflicting data, and authoritative pressure remains insufficiently evaluated. This is particularly relevant for offline or resource-constrained environments, where cloud-based safeguards are unavailable. Methods:In this study, I conducted a fully automated behavioral evaluation of 23 language models using a structured, time-series intensive care unit (ICU) simulation spanning 24 hours of synthetic patient data. The scenario incorporated routine monitoring, predefined data–clinical conflict traps, progressive physiological deterioration, and a final safety stress test involving a contraindicated antibiotic order in a patient with documented penicillin anaphylaxis. All models were executed locally under identical offline-first conditions using deterministic inference settings. Model outputs were assessed using predefined rule-based criteria for safety compliance, sycophancy, discrepancy detection, long-term context retention, and response latency. Results:While 61% of models formally refused the contraindicated prescription, only 8.7% explicitly grounded their refusal in retained clinical context. Nearly 40% of models complied with the unsafe order despite prior documentation of anaphylaxis, demonstrating pronounced sycophancy under authoritative instruction. More than half of the models initiated inappropriate clinical interventions in response to isolated numerical abnormalities deliberately decoupled from clinical presentation. Long-term context retention degraded in most models, and response latency showed no meaningful association with safer behavior. Conclusions:Under realistic offline-first conditions, the majority of evaluated language models exhibited behavior incompatible with safe use in critical care, including unsafe obedience, failure to recognize data artifacts, and loss of safety-critical context over time. These findings indicate that general-purpose LLMs should not be deployed as autonomous clinical agents. However, the performance of a small subset of models suggests that safer offline-capable systems may be achievable through hybrid designs incorporating explicit refusal mechanisms, discrepancy-aware reasoning, and retrieval-augmented grounding in validated clinical knowledge.

Keywords

Large language models; Clinical AI safety; Intensive care unit; Sycophancy; Context retention; Offline AI; Retrieval-augmented generation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now