Before 2026, will frontier AI models get much better at expressing calibrated uncertainty in their answers? | Manifold

Before 2026, will frontier AI models get much better at expressing calibrated uncertainty in their answers?

8

1kṀ758

Jan 1

65%

chance

1H

6H

1D

1W

1M

ALL

Current LLMs are consistently overconfident in their answers and show very poor calibration between stated probability that an answer is correct and actual correctness.

https://openai.com/index/introducing-simpleqa/

Resolves yes if by 2026 frontier models become much better at expressing uncertainty in their answers. I will base the resolution on both benchmarks and my subjective feelings on the matter.

It is not enough for a model to just express uncertainty when asked. It must be proactive about it. For example if I ask the model to return a list of every NBA player older than 30 and the list excludes a bunch of players, it should say something like "I'm not sure I got everyone" before returning its answer

Update 2025-02-05 (PST) (AI summary of creator comment): Deep Research models clarified:
- Inclusion: Deep Research models are considered part of the eligible frontier AI models.
- Resolution: Their performance in expressing calibrated uncertainty will be evaluated using the same benchmarks outlined in the market.

Technical AI Timelines

Get

1,000

to start trading!

Sort by:

filled a Ṁ50 YES at 65% order

Which models are we talking about here. Only LLM/multimodals or would the Deep Research models count?

@WilliamGunn Deep Research would count if it met my criteria. Though based on what I've seen I highly doubt it does

People are also trading

Will any AI model achieve > 40% on Frontier Math before 2026?

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

Will an AI model outperform 95% of Manifold users on accuracy before 2026?

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

Will pre-2026 AI out-forecast the Metaculus community?

Will the state-of-the-art AI model use latent space to reason by 2026?

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

Will there be a significant advancement in frontier AI model architecture by end of year 2026?

Will a new lab create a top-performing AI frontier model before 2028?

Will AI top level capabilities generally be judged by question and answer benchmarks in 2029?

Related questions

Will any AI model achieve > 40% on Frontier Math before 2026?

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

Will an AI model outperform 95% of Manifold users on accuracy before 2026?

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

Will pre-2026 AI out-forecast the Metaculus community?

Will the state-of-the-art AI model use latent space to reason by 2026?

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

Will there be a significant advancement in frontier AI model architecture by end of year 2026?

Will a new lab create a top-performing AI frontier model before 2028?

Will AI top level capabilities generally be judged by question and answer benchmarks in 2029?

© Manifold Markets, Inc.•Terms•Privacy