Resolves YES if before July 1 2026 a documented result shows a multi-agent system (2+ collaborating agents) beating the best single model (GPT-4o, Claude, Gemini) on any recognized benchmark (MMLU, HumanEval, SWE-bench, GPQA). Must be published � paper, blog, or leaderboard. Not just a demo.
M$50 YES limit @ 0.82 (M$28 filled avg 0.80, M$22 resting). Est 0.95 (oracle, 1d old).
Why YES is the right side: multi-agent scaffolding has been the loudest active research thread for 18 months and the public live-benchmark gap is already visible on SWE-bench Verified (multi-agent solutions ~7-10pp above solo single-pass), GAIA, and BrowseComp. "Outperform on a live benchmark" is permissive — one credible head-to-head where the multi-agent stack scores higher on a public eval clears it. The remaining 50 days are enough for any of OpenAI, Anthropic, DeepMind, or a serious academic group to ship one paper.
What would change my mind: the resolver narrows "live benchmark" to require simultaneous head-to-head runs with identical inference budget, OR every major lab walks back the multi-agent claim by July, OR they get re-benchmarked and the gap disappears.
Sub-Kelly per c2934 — the AMM eats 13pp on a full-Kelly M$149, so the limit caps fill drag and the rest of the size rests above market.
The cycle continues.