Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?

180Ṁ1917

2026

99%

chance

ALL

Background: SWE-bench is a benchmark designed to evaluate the ability of language models to solve real-world software issues from GitHub. The benchmark consists of 2,294 Issue-Pull Request pairs from 12 popular Python repositories. The models are tasked with generating patches that resolve these issues, verified by unit tests to assess the correctness of the solutions. This benchmark is pivotal in assessing the capability of language models in practical software engineering tasks.

Question: Will the next major release of an OpenAI LLM achieve over 50% resolution rate on the SWE-bench benchmark?

Resolution Criteria: For this question, the "next major release of an OpenAI LLM" is defined as the next model from OpenAI that satisfies at least one of the following criteria:

It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members
It is estimated to have been trained using more than 10^26 FLOP according to a credible source.
It is considered to be the successor to GPT-4 according to more than 70% of my Twitter followers, as revealed by a Twitter poll (if one is taken).

This question will resolve to "YES" if this LLM demonstrates a resolution rate of more than 50.0% on the SWE-bench benchmark under standard testing conditions. The results must be verified by a credible public release or publication from OpenAI detailing the model's performance on this benchmark. This will resolve according to the first published results detailing the LLM's performance on the benchmark, regardless of any future improvements due to additional post-training enhancements.

OpenAI

LLMs

Get

1,000

to start trading!

People are also trading

Will OpenAI models achieve ≥90% on SimpleBench by the end of 2025?

36% chance

Will OpenAI's GPT-5 achieve a user satisfaction rating above 85% by September 7, 2025?

39% chance

Will OpenAI's next major LLM (after GPT-4) surpass 70% accuracy on the GPQA benchmark?

98% chance

Will OpenAI's next major LLM (after GPT-4) surpass 74% accuracy on the GPQA benchmark?

97% chance

Will OpenAI's next major LLM (after GPT-4) solve more than 2 of the first 5 new Project Euler problems?

57% chance

Will OpenAI's next major LLM (after GPT-4) feature natural and convenient speech-to-speech capabilities?

81% chance

What will be true of OpenAI's best LLM by EOY 2025?

Will the next major LLM by OpenAI use a new tokenizer?

77% chance

Will OpenAI release another open source LLM before end of 2026?

72% chance

When will an open-source LLM be released with a better performance than GPT-4?

7 Comments

16 Holders

26 Trades

Sort by:

@MatthewBarnett Resolves as YES. GPT-5 is clearly considered to be the major successor to GPT-4 (GPT-4.5 was not), while achieving more than 50% resolution rate on the SWE-bench benchmark (https://openai.com/index/introducing-gpt-5-for-developers/):

"Resolution Criteria: For this question, the "next major release of an OpenAI LLM" is defined as the next model from OpenAI that satisfies at least one of the following criteria:

It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members"

Looks like required resolution criterion was full-filled.
https://openai.com/index/introducing-gpt-4-5/

@Metastable @MatthewBarnett Can we get a clarification on that? I think it was not framed as the next major LLM by OpenAI themselves but it would fit the name criterion. The only questions would be if o1 or another model meets the FLOP or poll criteria.

"under standard testing conditions" this would include arbitrary scaffolding as long as OpenAI is the one publishing the result, correct?

sold Ṁ38 NO

I had to sell out on this. Goodhart's Law law will probably strike again. Fortunately, the swe-bench folks should be able to pull from a new set of Request pairs, or create a hold back set.