Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Preprint
Data sources: ZENODO
addClaim

DARP — Diff-Anchored Reporting Proof Publishing tamper-evident behavioral results from closed, proprietary environments without releasing code or logs

Authors: PEREIRA, Luciano Federico;

DARP — Diff-Anchored Reporting Proof Publishing tamper-evident behavioral results from closed, proprietary environments without releasing code or logs

Abstract

Measuring agent behavior in production environments differs fundamentally from benchmark evaluation, creating conditions that directly undermine whether empirical claims can be trusted. While open-source research relies on the publication of full repositories and raw data for independent re-execution, production environments are frequently constrained by intellectual property boundaries, privacy concerns, and security policies that preclude sharing code or logs. The conventional basis for empirical trust is therefore absent; a behavioral statistic from such a deployment can be neither reproduced nor audited by an external reviewer. This opacity is compounded by reflexivity: the measurement instrument is itself software under active development, often modified by the very class of agent it measures.We isolate three specific failure modes inherent to this environment—auditor = subject, instrument-equivalence, and unanchored followthrough—and introduce DARP(Diff-Anchored Reporting Proof). DARP is a protocol designed to publish verifiable behavioral results from proprietary or closed environments without exposing the underlying codebase.DARP generates a self-contained, de-identified artifact by hashing session identifiers, scrubbing directory paths, and reducing tool interactions to abstract, typed schemas.Using this embedded event stream, an external reviewer can independently re-derive the reported metrics and validate them against a cryptographic content hash, an OpenTimestamps proof anchored to Bitcoin, and the author’s ORCID record. This validation protocol relies entirely on public read access and requires neither the original repository nor privileged credentials. Although DARP serves as a generalized validation mechanism applicable to public repositories, its primary contribution is architectural discipline for closed environments: any behavioral claim a system asserts about itself should be constructed so that an external party can deterministically re-derive it without access to the host system.Empirical validation in software-driven agent deployments is constrained by a persistent operational boundary: the codebases and execution environments are frequentlyproprietary. Execution traces name internal network endpoints, specific directory paths, and sensitive system identifiers. Publishing these raw repositories or uneditedlogs would constitute a severe security breach rather than a mere administrative hurdle. This constraint is endemic to production-grade agent measurement, and it effec-tively nullifies standard open-science remedies. Both pre-registration and open-data policies presume that the measurement instrument and its complete input corpus canbe exposed to external scrutiny. When privacy or intellectual property boundaries prevent this exposure, conventional pathways to replication collapse.Pre-registration—the ex-ante commitment to outcome variables, operational thresholds, and analytical pipelines—was explicitly developed to restrict an evaluator’s de-grees of freedom (Simmons et al., 2011; Leamer, 1983). It remains the structural benchmark in clinical trials (De Angelis et al., 2004) and is increasingly advocatedwithin empirical machine learning (Pineau et al., 2021). Yet, while pre-registration is a necessary step, it falls short when applied to live agentic systems due to two dis-tinct dynamics. First, it assumes a static, frozen instrument, which a codebase under active, often AI-assisted development lacks. Second, it assumes independent dataanalysis, which a secure production environment precludes. DARP bridges this gap: a behavioral claim that cannot be validated by releasing the underlying system mustinstead be encapsulated in a detached artifact that an external party can evaluate independently.This framework is governed by the principle of reflexivity. An agent cannot reliably validate its own assertions, as the verifying mechanism is vulnerable to the samesystemic failure modes as the primary action. Verification must therefore shift to a non-agent mechanism at a serialization boundary. This logic scales directly to mea-surement. A self-reported behavioral statistic is merely an unverified assertion made by the subject system. Because an agentic system cannot serve as its own auditor,the resulting metrics must be rendered verifiable by an independent non-participant rather than accepted on the producing agent’s authority.

Powered by OpenAIRE graph
Found an issue? Give us feedback