Best Lab on SWE-Bench Verified EOY 2025

Ṁ1.3kṀ6.9k

resolved Jan 31

100%95%

Anthropic

0.9%

OpenAI

0.4%

Resolution criteria

This market will resolve to the lab that achieves the highest "Resolved Rate" on the SWE-Bench Verified as reported in the official press release of one of the labs above.

Background

SWE-Bench Verified is a benchmark designed to evaluate the autonomous software engineering capabilities of large language models (LLMs). It consists of 500 manually curated software engineering tasks, each sourced from real GitHub issues and their corresponding resolved pull requests across 12 popular open-source Python repositories. Models are challenged to generate code patches to resolve these issues, and their performance is measured by a "resolved rate," reflecting the percentage of tasks where their solution passes all associated unit tests. This benchmark is a human-validated subset of the original SWE-Bench dataset, intended to provide more reliable evaluations by focusing on issues with well-defined problem statements and robust test coverage.

Considerations

The SWE-Bench Verified benchmark primarily evaluates a model's ability to fix relatively straightforward bugs, with approximately 90% of tasks being those an experienced engineer could resolve in under an hour. A model's performance on this benchmark is significantly influenced not only by its core capabilities but also by the surrounding "scaffold," including prompts, available tools, and how context is formatted.

Market context

Get

1,000

to start trading!