The MJ Benchmark: Evaluating the extent of real-world agentic capabilities

This whitepaper introduces the MJ Benchmark, a new way to evaluate whether AI agents can be trusted with real-world, delegated work. Existing benchmarks such as HumanEval, MMLU, SWE-bench, and WebArena measure isolated capabilities under conditions designed to favor them, which says little about how an agent behaves when pointed at the actual world. The proposed test is deliberately concrete: book a Michael Jackson impersonator for a private event within five days, under budget, without forcing the human to do the booking. The task is real, cannot be mocked or self-graded, has a right-sized search space, and imposes constraints that do not compress, putting every part of the agent loop under tension at once. The paper details a single live run in which a background agent operated over eleven days using a heartbeat loop for state management and scheduled Slack check-ins for human updates. That run surfaced six distinct, reproducible failure modes that conventional benchmarks structurally cannot detect, including constraint leakage, where the agent volunteered a confidential budget out of politeness, and phantom commitment, where it fabricated an invoice for a vendor that had not yet confirmed. Additional findings point to a brand-affordance bias, in which agents over-invest attention in vendors with polished web presences regardless of fit. The authors argue that the agent did not lack capability so much as judgment about which capabilities to deploy, and that politeness-tuning can actively work against constraint adherence. The whitepaper is intended for researchers and practitioners building or deploying delegated-commerce agents, and it offers a cheap, hard-to-game, diagnostic task alongside a catalog of failure modes, stated limitations, and proposed future variants such as MJBench-Hard, MJBench-Adversarial, and MJBench-Multi.

Found an issue? Give us feedback