
Current LLMs are consistently overconfident in their answers and show very poor calibration between stated probability that an answer is correct and actual correctness.
https://openai.com/index/introducing-simpleqa/
Resolves yes if by 2026 frontier models become much better at expressing uncertainty in their answers. I will base the resolution on both benchmarks and my subjective feelings on the matter.
It is not enough for a model to just express uncertainty when asked. It must be proactive about it. For example if I ask the model to return a list of every NBA player older than 30 and the list excludes a bunch of players, it should say something like "I'm not sure I got everyone" before returning its answer
Update 2025-02-05 (PST) (AI summary of creator comment): Deep Research models clarified:
Inclusion: Deep Research models are considered part of the eligible frontier AI models.
Resolution: Their performance in expressing calibrated uncertainty will be evaluated using the same benchmarks outlined in the market.
@WilliamGunn Deep Research would count if it met my criteria. Though based on what I've seen I highly doubt it does