Will OpenAI's next major LLM (after GPT-4) surpass 74% accuracy on the GPQA benchmark?
Will OpenAI's next major LLM (after GPT-4) surpass 74% accuracy on the GPQA benchmark?
15
240Ṁ739
2026
85%
chance

Background: The GPQA (Graduate-Level Google-Proof Q&A Benchmark) is designed to evaluate the capabilities of Large Language Models (LLMs) in answering complex, expert-level multiple-choice questions across disciplines such as biology, physics, and chemistry. This benchmark challenges models with questions that require deep understanding and cannot be solved through simple web searches, reflecting real graduate-level knowledge.

Question: Will the next major release of an OpenAI LLM surpass 74% accuracy on the GPQA benchmark?

Resolution Criteria: For the purpose of this question, the "next major release of an OpenAI LLM" is the next model from OpenAI that satisfies at least one of the following criteria:

  1. It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members

  2. It is estimated to have been trained using more than 10^26 FLOP according to a credible source.

  3. It is considered to be the successor to GPT-4 according to more than 70% of my Twitter followers, as revealed by a Twitter poll (if one is taken).

This question will resolve to "YES" if the next major release of an OpenAI LLM released by OpenAI achieves an accuracy rate exceeding 74.0% on the GPQA benchmark using any method, as documented in the first credible public release or publication from OpenAI documenting the model's performance statistics.

More details:

  • The GPQA consists of 448 expert-crafted questions where domain experts reach 65% accuracy (74% adjusted for clear errors). Highly skilled validators, even with unrestricted web access, only reach 34% accuracy, highlighting the difficulty and sophistication required.

  • GPT-4 achieved only 39% accuracy in the original study, although Claude 3 Opus was able to achieve 59.5% when using Maj@32 averaged over 10 iterations.

Get
Ṁ1,000
to start trading!


Sort by:
6mo

Does o1 meet your criteria @MatthewBarnett?

9mo

This is the next model at time of release, correct?

6mo

@JacobPfau Yes, quoting from the criteria,

This question will resolve to "YES" if the next major release of an OpenAI LLM released by OpenAI achieves an accuracy rate exceeding 74.0% on the GPQA benchmark using any method, as documented in the first credible public release or publication from OpenAI documenting the model's performance statistics.

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Or create your own play-money betting market on any question you care about.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like betting still use Manifold to get reliable news.
ṀWhy use play money?
Mana (Ṁ) is the play-money currency used to bet on Manifold. It cannot be converted to cash. All users start with Ṁ1,000 for free.
Play money means it's much easier for anyone anywhere in the world to get started and try out forecasting without any risk. It also means there's more freedom to create and bet on any type of question.
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules