SimpleQA is an obscure trivia benchmark OpenAI uses to evaluate hallucination rate. Resolves yes if a Frontier model released by OpenAI declines to answer SimpleQA questions at least 5 times as often as it incorrectly answers them.
To resolve yes, the model must not have access to the internet or any database of facts. I will be largely looking towards OpenAI's evaluations to resolve this market but will accept 3rd party evals if OpenAI stops using SimpleQA.
OpenAI's latest frontier model GPT-5-thinking has only a 1:8 ratio. However, OpenAI claimed in a recent paper that they have new insights into why LLMs hallucinate and how to prevent them. Additionally, the smaller GPT-5-thinking-mini achieves nearly a 2:1 ratio.
A result that seems obviously rigged (such as a model achieving 100% accuracy or a model that declines to answer every question or something similar) will not resolve yes.