Will OpenAI's next major LLM (after GPT-4) surpass 70% accuracy on the GPQA benchmark?

220Ṁ1367

2026

98%

chance

ALL

Background: The GPQA (Graduate-Level Google-Proof Q&A Benchmark) is designed to evaluate the capabilities of Large Language Models (LLMs) in answering complex, expert-level multiple-choice questions across disciplines such as biology, physics, and chemistry. This benchmark challenges models with questions that require deep understanding and cannot be solved through simple web searches, reflecting real graduate-level knowledge.

Question: Will the next major release of an OpenAI LLM surpass 70% accuracy on the GPQA benchmark?

Resolution Criteria: For the purpose of this question, the "next major release of an OpenAI LLM" is the next model from OpenAI that satisfies at least one of the following criteria:

It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members
It is estimated to have been trained using more than 10^26 FLOP according to a credible source.
It is considered to be the successor to GPT-4 according to more than 70% of my Twitter followers, as revealed by a Twitter poll (if one is taken).

This question will resolve to "YES" if the next major release of an OpenAI LLM released by OpenAI achieves an accuracy rate exceeding 70% on the GPQA benchmark using any method, as documented in the first credible public release or publication from OpenAI documenting the model's performance statistics.

More details:

The GPQA consists of 448 expert-crafted questions where domain experts reach 65% accuracy (74% adjusted for clear errors). Highly skilled validators, even with unrestricted web access, only reach 34% accuracy, highlighting the difficulty and sophistication required.
GPT-4 achieved only 39% accuracy in the original study, although Claude 3 Opus was able to achieve 59.5% when using Maj@32 averaged over 10 iterations.

OpenAI

Technical AI Timelines

GPT-5

LLMs

Get

1,000

to start trading!

3 Comments

14 Holders

16 Trades

Sort by:

This should resolve YES from GPT-5

o1 gets 78.3 on the GPQA. this is resolved

@PhillipBallardsoftclone It does not count as a major release by the standard defined here. I think Gpt-4o might if there was a twitter pole just asking if it's the successor to gpt-4 though I'm not certain and don't think it meets the spirit of the criteria either tbh.