Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?
Basic
9
Ṁ304
2026
17%
chance

Background: SWE-bench is a benchmark designed to evaluate the ability of language models to solve real-world software issues from GitHub. The benchmark consists of 2,294 Issue-Pull Request pairs from 12 popular Python repositories. The models are tasked with generating patches that resolve these issues, verified by unit tests to assess the correctness of the solutions. This benchmark is pivotal in assessing the capability of language models in practical software engineering tasks.

Question: Will the next major release of an OpenAI LLM achieve over 50% resolution rate on the SWE-bench benchmark?

Resolution Criteria: For this question, the "next major release of an OpenAI LLM" is defined as the next model from OpenAI that satisfies at least one of the following criteria:

  1. It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members

  2. It is estimated to have been trained using more than 10^26 FLOP according to a credible source.

  3. It is considered to be the successor to GPT-4 according to more than 70% of my Twitter followers, as revealed by a Twitter poll (if one is taken).

This question will resolve to "YES" if this LLM demonstrates a resolution rate of more than 50.0% on the SWE-bench benchmark under standard testing conditions. The results must be verified by a credible public release or publication from OpenAI detailing the model's performance on this benchmark. This will resolve according to the first published results detailing the LLM's performance on the benchmark, regardless of any future improvements due to additional post-training enhancements.

Get
Ṁ1,000
and
S1.00
Sort by:

"under standard testing conditions" this would include arbitrary scaffolding as long as OpenAI is the one publishing the result, correct?

sold Ṁ38 NO

I had to sell out on this. Goodhart's Law law will probably strike again. Fortunately, the swe-bench folks should be able to pull from a new set of Request pairs, or create a hold back set.

bought Ṁ20 NO

nice eval. do you have a link that you'd use to resolve this for gpt4? couldn't find one. i went off this https://www.swebench.com/

Comment hidden