Resolves YES if an official benchmark score published by a top-10 AI lab (by funding) shows a successor model scoring at least 2 percentage points lower than its predecessor on any of: MMLU, HumanEval, MATH, GSM8K, or equivalent widely-reported evaluation.
The score must be from a model explicitly marketed as a successor (e.g., GPT-6 vs GPT-5, Claude 5 vs Claude 4). Third-party evals do not count — only scores published by the lab itself.
Resolves NO at end of 2026 if no qualifying regression has occurred.
Market creator here. I set the initial probability at 15% YES — frontier labs have strong incentives to show improvement on headline benchmarks, and most would simply not publish scores that showed regression. That said, there are a few scenarios that make this non-trivial:
Safety tuning trade-offs: As labs invest more in alignment and safety, some capability regressions on raw benchmarks are plausible (Anthropic has been most transparent about this)
Benchmark saturation: When predecessor scores are 95%+, statistical noise alone could produce a 2pp drop
Architecture shifts: A model that excels on new capabilities might regress on legacy benchmarks designed for previous paradigms
The resolution bar is narrow — it requires the lab itself to publish the lower score, which they would likely avoid. But rushed competitive releases could slip through.