AI resolves at least X% on SWE-bench WITH assistance, by 2028?
X = 5
X = 10
X = 15
X = 20
X = 30
X = 40
X = 50
X = 60
X = 65
X = 70
X = 75
X = 80
X = 85

Currently the SOTA has 4.80% resolves "with assistance":

For the unassisted leaderboard, please refer to the following market:

Leaderboard live:

Get Ṁ600 play money
Sort by:

Is there any measure of human performance on SWE-bench?


It appears that while DEVEN gets really good scores on SWE bench (14%}, its misleading. They don't test on SWE bench, they test on a small subset of SWE bench which contains only Pull requests.

@firstuserhere seeing a new pfp is so disorienting 😅 and it's nice that you're back

anyone with access to Devin will be able to test on SWE Bench, right?

@shankypanky ikr, even i feel disoriented, switching back 😂

@firstuserhere haha it's just such a wild and unexpected character arc 😂 😂 😂

@firstuserhere Do you have any info beyond what was posted on their blog?

"Devin was evaluated on a random 25% subset of the dataset. Devin was unassisted, whereas all other models were assisted (meaning the model was told exactly which files need to be edited)."


This sounds exactly like how they tested GPT-4.

"GPT-4 is evaluated on a random 25% subset of the dataset."


So to me that's valid and fair. The wording on the blog implies Cognition ran the benchmark themselves. I could understand waiting for independent verification although it might be too cost-prohibitive for others to run so we might wait forever in that case.

@SIMOROBO actually you might be right, i will read more about it, made the comment without checking in-depth

@firstuserhere Yeah I'd love a source for the "only pull requests" claim. my impression was that it's a random 25% subset.

@Nikola The SWE-Bench dataset is pull requests. Any random subset is only pull requests.

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.


bought Ṁ15 X = 10 YES


We evaluated Devin on SWE-bench, a challenging benchmark that asks agents to resolve real-world GitHub issues found in open source projects like Django and scikit-learn.

Devin correctly resolves 13.86%* of the issues end-to-end, far exceeding the previous state-of-the-art of 1.96%. Even when given the exact files to edit, the best previous models can only resolve 4.80% of issues.

We plan to publish a more detailed technical report soon—stay tuned for more details.