Background
GPQA is a graduate-level, 448-item multiple-choice benchmark covering biology, chemistry and physics; it was designed to be “Google-proof,” defeating web-search tricks and challenging even PhD holders ― human experts average ≈ 65% accuracy, while skilled non-experts manage only 34%.
Because GPQA questions are publicly released, leaderboard results come from independent community test harnesses (e.g., Vellum AI, LLM-Stats, xAI livestreams). The best-known AI score today (July 2025) is 88.4%
Resolution criteria
The market resolves to the first year in which ALL the following conditions hold:
Score threshold – A single fully-autonomous system achieves ≥ 95% average accuracy on the standard 448-question GPQA.
Verification – The result is confirmed by either
a) a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS) including full evidence or
b) an official public leaderboard entry (e.g., Vellum AI, LLM-Stats, or a GPQA maintainer-run board).
Autonomy – After evaluation starts, no human may alter answers; chain-of-thought can be hidden, but tool-use (e.g., Python, calculators) must be invoked autonomously.
Expiry – If no qualifying run is verified by Jan 1, 2030 the market resolves “Not Applicable.”