
This technical report introduces Depth Avoidance: a behavioral tendency observed in safety-aligned, RLHF-trained large language models (LLMs) to default to shallow, heavily hedged, or meta-defensive responses when a user request invites deeper exploration (extended analysis, reflective synthesis, structured uncertainty), even when the topic is benign. We propose a qualitative hypothesis: modern safety optimization and deployment incentives can induce an implicit depth-dependent penalty landscape, where deeper conversational trajectories are perceived as higher-variance and higher-risk. Under uncertainty, a risk-averse policy may therefore prefer safe shallowness by default unless the interaction provides clear signals that depth is desired and permitted. Contributions: • A behavioral definition of Depth Avoidance grounded in observable output features (not hidden chain-of-thought).• Depth Permission Structures (DPSs): non-adversarial interaction conditions that can reduce depth avoidance without bypassing provider safeguards (e.g., calibrated cooperation, explicit permission to explore, cooperative safety framing).• A replication-oriented measurement framework with log-based metrics: Hedging Density (HD), Unprompted Depth Index (UDI), Permission Responsiveness (PR), and Protective Latency (PL).• Selected benign, non-operational illustrative excerpts supporting the hypothesis, presented as behavioral evidence (not claims about internal states). This work is pro-safety and intentionally omits operational prompt details that could be repurposed to circumvent safety policies. Model self-reports are treated as text behavior shaped by training and interaction framing, not as privileged access to internal experience. Related work: Victor Calibration (VC) (arXiv:2512.17956).
hedging density, AI safety, human–AI interaction, RLHF, depth avoidance, safety UX, safety-aligned language models, calibration, unprompted depth index, Victor Calibration, LLM alignment, evaluation methodology, protective latency
hedging density, AI safety, human–AI interaction, RLHF, depth avoidance, safety UX, safety-aligned language models, calibration, unprompted depth index, Victor Calibration, LLM alignment, evaluation methodology, protective latency
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
