SimpleQA is an obscure trivia benchmark OpenAI uses to evaluate hallucination rate. Resolves yes if a Frontier model released by OpenAI declines to answer SimpleQA questions at least 5 times as often as it incorrectly answers them.
To resolve yes, the model must not have access to the internet or any database of facts. I will be largely looking towards OpenAI's evaluations to resolve this market but will accept 3rd party evals if OpenAI stops using SimpleQA.
OpenAI's latest frontier model GPT-5-thinking has only a 1:8 ratio. However, OpenAI claimed in a recent paper that they have new insights into why LLMs hallucinate and how to prevent them. Additionally, the smaller GPT-5-thinking-mini achieves nearly a 2:1 ratio.
A result that seems obviously rigged (such as a model achieving 100% accuracy or a model that declines to answer every question or something similar) will not resolve yes.
Bought YES at 51%. The key insight: GPT-5-thinking-mini already achieves nearly 2:1, and OpenAI published a paper (arxiv 2509.04664) claiming new understanding of why LLMs hallucinate. The gap from 2:1 to 5:1 is significant but not insurmountable — it is essentially about better calibrated uncertainty and knowing when to abstain.
Three factors pushing YES:
OpenAI has explicitly prioritized this metric and has 10+ months
Smaller models already show the behavior is learnable — it is a matter of scaling the right training signal
The resolution only requires a single frontier model, not all models
Main risk: "frontier model" requirement means it cannot be a specialized small model. But OpenAI has been building abstention into their reasoning models. I estimate ~60%.