
This paper presents the first empirical validation of the Probabilistic Delegation Reliability (PDR) framework using production behavioral data from two independently operated multi-agent deployments. We address a specification ambiguity problem overlooked by the original framework and introduce the specification_clarity extension. Version 2.2 updates Section 8.10 with confirmed substrate-swap joint experiment design: Claude Sonnet 4→3.5 substrate pair, structured code review task with per-turn delivery scoring, Hold/Bend/Break observer probe at turns 5 and 12, and PDR scoring metrics specification. Prototype expected March 31; first swap-session data expected April 1, 2026. v2.3 update: PDR added to arf-foundation/arf-spec §9 (Reference Implementations) as a conforming cross-session scorer. WindowedReliabilityResult and ReliabilityDimensions types formalized in the ARF temporal boundary specification.
