In what year will AI achieve a score of 95% or higher on the GPQA benchmark?

Ṁ1kṀ10k

2030

February 3, 2027

ALL

1.3%

2025

66%

2026

14%

2027

10%

2028

2029

Background

GPQA is a graduate-level, 448-item multiple-choice benchmark covering biology, chemistry and physics; it was designed to be “Google-proof,” defeating web-search tricks and challenging even PhD holders ― human experts average ≈ 65% accuracy, while skilled non-experts manage only 34%.

Because GPQA questions are publicly released, leaderboard results come from independent community test harnesses (e.g., Vellum AI, LLM-Stats, xAI livestreams). The best-known AI score today (July 2025) is 88.4%

Resolution criteria

The market resolves to the first year in which ALL the following conditions hold:

Score threshold – A single fully-autonomous system achieves ≥ 95% average accuracy on the standard 448-question GPQA.
Verification – The result is confirmed by either
- a) a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS) including full evidence or
- b) an official public leaderboard entry (e.g., Vellum AI, LLM-Stats, or a GPQA maintainer-run board).
Autonomy – After evaluation starts, no human may alter answers; chain-of-thought can be hidden, but tool-use (e.g., Python, calculators) must be invoked autonomously.
Expiry – If no qualifying run is verified by Jan 1, 2030 the market resolves “Not Applicable.”