Top SWE-Bench Verified score in 2025?
2
10kṀ4900
2026
86.2 %
expected
9%
Below 70%
25%
70 - 84%
41%
85% - 95%
25%
Above 95%

Background

SWE-Bench Verified is a benchmark for evaluating AI models' ability to solve real-world software engineering tasks. It measures how effectively models can fix bugs in open-source repositories, with verification that the fixes actually work. Claude 3.5 Sonnet achieved a score of 49% in October 2024, while the best performance as of December 2024 was approximately 62.2%.

SWE-Bench Verified is considered a challenging benchmark that tests models' capabilities in:

  • Understanding complex codebases

  • Reasoning about software architecture

  • Implementing correct fixes for real bugs

  • Working within existing code constraints

Resolution Criteria

This market will resolve to the highest verified score achieved on the SWE-Bench Verified benchmark during the 2025 calendar year (January 1, 2025 to December 31, 2025). The score will be based on official announcements from research labs, companies, or academic institutions that develop AI models.

For a score to be considered valid:

  • It must be publicly announced and verifiable

  • It must use the standard SWE-Bench Verified methodology

  • It must be achieved by a single model or system (not an ensemble of different approaches)

  • The score must be reported as a percentage (e.g., 75.3%)

If no new scores are reported during 2025, the market will resolve to the last known score from 2024 (approximately 62.2%).

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules