Top SWE-Bench Verified score in 2025?

10kṀ42k

Jan 1

84.8 %

expected

ALL

1.3%

Below 70%

40%

70 - 84%

54%

85% - 95%

Above 95%

Background

SWE-Bench Verified is a benchmark for evaluating AI models' ability to solve real-world software engineering tasks. It measures how effectively models can fix bugs in open-source repositories, with verification that the fixes actually work. Claude 3.5 Sonnet achieved a score of 49% in October 2024, while the best performance as of December 2024 was approximately 62.2%.

SWE-Bench Verified is considered a challenging benchmark that tests models' capabilities in:

Understanding complex codebases
Reasoning about software architecture
Implementing correct fixes for real bugs
Working within existing code constraints

Resolution Criteria

This market will resolve to the highest verified score achieved on the SWE-Bench Verified benchmark during the 2025 calendar year (January 1, 2025 to December 31, 2025). The score will be based on official announcements from research labs, companies, or academic institutions that develop AI models.

For a score to be considered valid:

It must be publicly announced and verifiable
It must use the standard SWE-Bench Verified methodology
It must be achieved by a single model or system (not an ensemble of different approaches)
The score must be reported as a percentage (e.g., 75.3%)

If no new scores are reported during 2025, the market will resolve to the last known score from 2024 (approximately 62.2%).

Technology

Technical AI Timelines

AI Benchmarks

Software Development

Get

1,000

to start trading!

2 Comments

24 Holders

93 Trades

Sort by:

bought Ṁ50 YES

Given how little improvement there's been since the o3 announcement in late 2024 (~71% to ~74%), I have no idea why anybody would think there's gonna be a magical 11% leap in the next 3 months

sold Ṁ0 NO

@traders If you have capital "Below 70%" (now at 6%) is a very safe bet, because from the official leaderboard "Tools + Claude 4 Opus (2025-05-22)" scored 73.2% on 2025-05-22.