AI resolves at least X% on SWE-bench assistance, by 2025?
14
83
Ṁ1.9KṀ545
Dec 31
1D
1W
1M
ALL
98.4%
X=5
92%
X=10
72%
X=20
32%
X=40
20%
X=80
The SWE-bench is a benchmark developed to evaluate if language models can resolve real-world GitHub issues. The leaderboard showcases various models and their performances in terms of the percentage of SWE-bench instances they resolved. Each instance in the SWE-bench represents a GitHub issue. The leaderboard is categorized into two main sections: Unassisted and Assisted.
Assisted: In this category, models have the advantage of the "oracle" retrieval setting where the correct files to edit are directly given to them.
This question is only about the Assisted category of this benchmark.
http://www.swebench.com/#
Current SOTA is 4.8
The prediction market will resolve based on the SWE-bench leaderboard standings as of 31 December 2024.
Get Ṁ200 play money
Related questions
Will an autonomous agent resolve 90% of tasks on SWE-bench by 2026?
53% chance
Will an autonomous agent resolve 90% of tasks on SWE-bench by 2027?
69% chance
What will be the best score on the SWE-Bench (unassisted) benchmark before 2025?
21% chance
Will an AI SWE model score higher than 50% on SWE-bench in 2024?
15% chance
AI equivalent of IPCC before 2027?
29% chance
Will AI pass the Winograd schema challenge by the end of 2025?
86% chance
Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?
22% chance
Will we solve AI alignment by 2026?
7% chance
Will AI pass the Winograd schema challenge by the end of 2024?
78% chance
Will I believe my prediction about AI enabling more SWEs to solve less lucrative problems by shrinking team sizes to have been fulfilled by EoY 2030
68% chance