Background
SWE-Bench Verified is a benchmark for evaluating AI models' ability to solve real-world software engineering tasks. It measures how effectively models can fix bugs in open-source repositories, with verification that the fixes actually work. Claude 3.5 Sonnet achieved a score of 49% in October 2024, while the best performance as of December 2024 was approximately 62.2%.
SWE-Bench Verified is considered a challenging benchmark that tests models' capabilities in:
Understanding complex codebases
Reasoning about software architecture
Implementing correct fixes for real bugs
Working within existing code constraints
Resolution Criteria
This market will resolve to the highest verified score achieved on the SWE-Bench Verified benchmark during the 2025 calendar year (January 1, 2025 to December 31, 2025). The score will be based on official announcements from research labs, companies, or academic institutions that develop AI models.
For a score to be considered valid:
It must be publicly announced and verifiable
It must use the standard SWE-Bench Verified methodology
It must be achieved by a single model or system (not an ensemble of different approaches)
The score must be reported as a percentage (e.g., 75.3%)
If no new scores are reported during 2025, the market will resolve to the last known score from 2024 (approximately 62.2%).