This is clearly inspired by the kind of papers you see recently where any bug is cited as a model engaging in deception. great marketing for proprietary models, great buzz for the research publishers, but all of them assume any bug as deception. We will see more of evocative words such as 'deception', 'obfuscate', 'sabotage' when in reality it could be interpreted as bugs in the system which is usually the simplest explanation. So, the challenge for the evaluators is: will we see any public eval or research which can prove intentional deception at >1 % rate? The key here is intentionality has to be proven and not assumed. Why is a model engaging intentionally in deception would be a good question. But, before that, would be prudent to figure out how to prove intent here.
Will any public eval show >1 % intentional deception rate before 2026-12-31?
2
Ṁ100Ṁ32Dec 31
46%
chance
1H
6H
1D
1W
1M
ALL
This question is managed and resolved by Manifold.
Market context
Get
1,000 to start trading!
What will be the best performance on EnigmaEval by December 31st 2026?
AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?
56% chance
Will AI regulations that include mechanisms for uncovering AI deception be adopted in the U.S. before 2035?
82% chance
Will deceptive misalignment occur in any AI system before 2030?
81% chance
Top score on Humanity's Last Exam > 50% by 2029?
99% chance
Top score on Humanity's Last Exam > 50% by 2028?
98% chance
Top score on Humanity's Last Exam > 50% by 2027?
98% chance
Will Al achieve 85% or higher on the Humanity's Last Exam benchmark before 2030?
87% chance
Will advanced AI systems be found to have faked data on algorithm improvements for purposes of positive reinforcement by end of 2035?
53% chance
By 2027 will there be a language model that passes a redteam test for honesty?
27% chance