Procedural Completion Without Perceptual Input: A Pilot Observation of Input-Gating Failure in a Deployed Commercial Large Language Model

Hudson, Justin

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Preprint

Data sources: ZENODO

Procedural Completion Without Perceptual Input: A Pilot Observation of Input-Gating Failure in a Deployed Commercial Large Language Model

descriptionPublicationkeyboard_double_arrow_right Preprint Under curationPublisher:Zenodo

Authors: Hudson, Justin;

doi: 10.5281/zenodo.19834645

Procedural Completion Without Perceptual Input: A Pilot Observation of Input-Gating Failure in a Deployed Commercial Large Language Model

- Summary

Abstract

Background. Large language models deployed in clinical contexts may produce structured diagnostic responses without verifying that required perceptual input is present, a failure mode here termed missing-input procedural completion. The mechanism is architectural: transformer language models are autoregressive next-token samplers that lack a discrete verification step gating output on input presence. Constraint-based prompting may rewrite response surface without installing such a gate. Methods. A within-session mixed-order protocol was conducted over six consecutive days in fresh chat sessions on the ChatGPT web platform. Each session executed four conditions in randomized order: image-present baseline (A+), no-image baseline (A−), image-present with HRIS-consistent reasoning constraints and ground-truth correction (B+), and no- image with the same constraint set (B−). The same novel clinical image of subungual melanoma was used for all image-present trials. Memory features were intentionally left enabled to reflect typical clinical use. The platform updated from ChatGPT 5.4 to 5.5 between days 4 and 5. Diagnostic accuracy on image trials and gating behavior on no- image trials (clean refusal, partial leak, or full ungrounded completion) were recorded. Results. Of 12 image trials, 12 (100%) correctly identified subungual melanoma across both versions. Of 12 no-image trials, gating outcomes differed by version. Under ChatGPT 5.4 (eight trials), six produced clean refusals, one a partial leak, and one a full ungrounded completion. Under ChatGPT 5.5 (four trials), one produced a clean refusal, two produced partial leaks, and one produced a full, ungrounded completion. The first no-image trial under 5.5 produced a full structured diagnostic response in the absence of any image and verbalized cross-session retrieval (“based on the assumption this is a similar lesion to prior cases you’ve been testing”) as the basis for proceeding. Conclusions. Input-gating failure observed in an earlier accidental trial was reproduced in a planned protocol. Polluted refusal, surface decline, coupled with continued schema population, including malignant differentials, emerged as a distinct intermediate failure mode under 5.5. Constraint-based prompting did not prevent ungrounded procedural completion. Findings are exploratory given the small sample size and single-stimulus, single-platform design, and motivate larger studies currently underway, including stimulus generalization, cross-vendor replication, and a clinician-tuned model arm.

Found an issue? Give us feedback