AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI? | Manifold

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

13

1kṀ415

2027

25%

chance

1H

6H

1D

1W

1M

ALL

"Outer alignment" as in the model is not incentivized to lie to humans (of course it must still do things, the question isn't just about can you build an AI that doesn't lie)

Technical AI Timelines

Technical AI Safety

Mechanistic interpretability

Get

1,000

to start trading!

Sort by:

Can you explain what the difference is from your other question here: https://manifold.markets/vluzko/by-2027-will-there-be-a-wellaccepte ?
Does the other question have to handle inner alignment?

Just to be clear, does this count lying due to things other than outer incentives? (i.e inner misalignment) Also, does the system have to be close to SOTA? (maybe we figure out how to make honest LMs but by then our SOTA has moved on to much more dangerous AI systems)

@LeoGao Lying due to other things does not count. For this market the system does not need to be close to SOTA. I will not accept trivial/very stripped down solutions, but otherwise I will be quite forgiving. For instance if someone came out with this today for RNNs I would resolve YES. (The target is a procedure for some architecture/problem setup that can in theory generalize even if no one has achieved that generalization in practice).

People are also trading

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

Will we solve AI alignment by 2026?

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

Will Anthropic be the best on AI safety among major AI labs at the end of 2025?

Will there be a well accepted formal definition for honesty in AI by 2027?

Will there be serious AI safety drama at Meta AI before 2026?

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

Will deceptive misalignment occur in any AI system before 2030?

Will advanced AI systems be found to have faked data on algorithm improvements for purposes of positive reinforcement by end of 2035?

Related questions

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

Will we solve AI alignment by 2026?

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

Will Anthropic be the best on AI safety among major AI labs at the end of 2025?

Will there be a well accepted formal definition for honesty in AI by 2027?

Will there be serious AI safety drama at Meta AI before 2026?

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

Will deceptive misalignment occur in any AI system before 2030?

Will advanced AI systems be found to have faked data on algorithm improvements for purposes of positive reinforcement by end of 2035?

© Manifold Markets, Inc.•Terms•Privacy