AI resolves at least X% on SWE-bench without any assistance, by 2028?

2.5kṀ8926

2027

95%

X = 40

95%

X = 50

97%

X = 60

94%

X = 70

95%

X = 75

93%

X = 80

95%

X = 85

94%

X = 90

73%

X = 95

Currently the SOTA has 1.96% resolves "unassisted"

For the % resolves where assistance is provided, please refer to the following market:

Leaderboard (Scroll a bit)

Technical AI Timelines

AI Impacts

Get

1,000

to start trading!

People are also trading

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

54% chance

Will an AI score over 80% on FrontierMath Benchmark in 2025

4% chance

AI resolves at least X% on SWE-bench WITH assistance, by 2028?

AI outperforms humans in all mathematical research areas by 2028?

20% chance

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2026?

50% chance

In what year will AI achieve a score of 95% or higher on the SWE-bench Verified benchmark?

12/5/27

What will be the highest score achieved on SWE-Bench Verified in 2025?

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2027?

69% chance

What will be the best performance on SWE-bench Verified by December 31st 2025?

Top Multi-SWE-bench score in 2025?

Sort by:

https://www.swebench.com/

@mods can you resolve the relevant options?

@HansPeter The website features multiple leaderboards, I assume the "full" one is closest to the original?

And there's no distinction between assisted and unassisted anymore, so is it all unassisted now?

@Agh yes. that seems correct

reposted

It appears that while DEVEN gets really good scores on SWE bench (14%}, its misleading. They don't test on SWE bench, they test on a small subset of SWE bench which contains only Pull requests.

@firstuserhere SWE-Bench is only pull requests:

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

See swebench.com

i'll resolve yes to x = 4 and 8 after a few days of wait just to make sure its all legit

bought Ṁ50 YES

@firstuserhere hey, could you please resolve

I resolved those

From https://www.cognition-labs.com/blog

We evaluated Devin on SWE-bench, a challenging benchmark that asks agents to resolve real-world GitHub issues found in open source projects like Django and scikit-learn.

Devin correctly resolves 13.86%* of the issues end-to-end, far exceeding the previous state-of-the-art of 1.96%. Even when given the exact files to edit, the best previous models can only resolve 4.80% of issues.