Resolution criteria
Resolves YES if, by 11:59 pm ET on December 31, 2025, the “SWE-bench Verified” percentage that Anthropic publicly lists for Claude Sonnet 4 (or any Sonnet 4.x variant, if released) is strictly higher than the “SWE-bench Verified” percentage OpenAI publicly lists for GPT‑5 on their official pages linked below. Otherwise (including a tie or if Sonnet has no published figure), resolves NO.
Sources to check at resolution time: OpenAI “Introducing GPT‑5” (Evaluations) and Anthropic “The best AI for developers” (SWE‑bench Verified callout). If either page is unavailable at the deadline, use the most recent Internet Archive snapshot captured on or before the deadline. Numbers referring to other Claude families (e.g., Opus) or to non‑“SWE‑bench Verified” variants do not count.OpenAI: “Introducing GPT‑5” → Evaluations section. https://openai.com/index/introducing-gpt-5/ (openai.com)
Anthropic: “The best AI for developers” → “Claude Sonnet 4 leads on SWE‑bench Verified” (72.7% callout). https://www.anthropic.com/solutions/coding (anthropic.com)
Background
OpenAI announced GPT‑5 on August 7, 2025; the page reports GPT‑5 at 74.9% on SWE‑bench Verified (OpenAI notes runs used a fixed n=477 subset). (openai.com)
Anthropic markets Claude Sonnet 4 with a 72.7% “SWE‑bench Verified” figure on its developer landing page. Anthropic also announced Opus 4.1 (a different, higher‑tier model) on August 5, 2025, citing gains in coding performance, but this market compares Sonnet vs GPT‑5. (anthropic.com)
Naming note: there is Claude Sonnet 4 and Claude Opus 4.1; “Sonnet 4.1” is not an official model name as of August 2025 per Anthropic’s model list. (docs.anthropic.com)
Considerations
“SWE‑bench Verified” can appear with different subsets/harnesses across posts; to avoid disputes, this market uses the vendors’ own published “SWE‑bench Verified” percentages on the linked official pages at the deadline, not third‑party leaderboards or agent pipelines. (openai.com, anthropic.com)