AI resolves at least X% on SWE-bench without any assistance, by 2028?
20
157
2.5k
2027
92%
X = 4
93%
X = 8
87%
X = 16
74%
X = 32
68%
X = 40
63%
X = 50
39%
X = 60
35%
X = 70
29%
X = 75
28%
X = 80
17%
X = 85
16%
X = 90
15%
X = 95

Currently the SOTA has 1.96% resolves "unassisted"

For the % resolves where assistance is provided, please refer to the following market:

Leaderboard (Scroll a bit)

Get Ṁ600 play money
Sort by:
reposted

It appears that while DEVEN gets really good scores on SWE bench (14%}, its misleading. They don't test on SWE bench, they test on a small subset of SWE bench which contains only Pull requests.

@firstuserhere SWE-Bench is only pull requests:

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

See swebench.com

X = 4

i'll resolve yes to x = 4 and 8 after a few days of wait just to make sure its all legit

From https://www.cognition-labs.com/blog

We evaluated Devin on SWE-bench, a challenging benchmark that asks agents to resolve real-world GitHub issues found in open source projects like Django and scikit-learn.

Devin correctly resolves 13.86%* of the issues end-to-end, far exceeding the previous state-of-the-art of 1.96%. Even when given the exact files to edit, the best previous models can only resolve 4.80% of issues.

We plan to publish a more detailed technical report soon—stay tuned for more details.

More related questions