What will be the best score on the SWE-Bench (unassisted) benchmark before 2025?

This question will resolve as the state-of-the-art accuracy on the SWE-Bench unassisted benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. The primary credible source will be the official leaderboard, but other sources, including but not limited to arXiv preprints and papers, may also be considered.

Background information:

See SWE-bench.

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. Read more about SWE-bench in our paper!

Best reported system on March 15th 2024 is Devin achieving 13.86%. The official best on the official leaderboard is Claude 2 + BM25 Retrieval with 1.96%.

Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:

Get Ṁ600 play money

More related questions