Background: The GPQA (Graduate-Level Google-Proof Q&A Benchmark) is designed to evaluate the capabilities of Large Language Models (LLMs) in answering complex, expert-level multiple-choice questions across disciplines such as biology, physics, and chemistry. This benchmark challenges models with questions that require deep understanding and cannot be solved through simple web searches, reflecting real graduate-level knowledge.
Question: Will the next major release of an OpenAI LLM surpass 70% accuracy on the GPQA benchmark?
Resolution Criteria: For the purpose of this question, the "next major release of an OpenAI LLM" is the next model from OpenAI that satisfies at least one of the following criteria:
It is consistently called "GPT-4.5" or "GPT-5" by OpenAI staff members
It is estimated to have been trained using more than 10^26 FLOP according to a credible source.
It is considered to be the successor to GPT-4 according to more than 70% of my Twitter followers, as revealed by a Twitter poll (if one is taken).
This question will resolve to "YES" if the next major release of an OpenAI LLM released by OpenAI achieves an accuracy rate exceeding 70% on the GPQA benchmark using any method, as documented in the first credible public release or publication from OpenAI documenting the model's performance statistics.
More details:
The GPQA consists of 448 expert-crafted questions where domain experts reach 65% accuracy (74% adjusted for clear errors). Highly skilled validators, even with unrestricted web access, only reach 34% accuracy, highlighting the difficulty and sophistication required.
GPT-4 achieved only 39% accuracy in the original study, although Claude 3 Opus was able to achieve 59.5% when using Maj@32 averaged over 10 iterations.