AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI? | Manifold

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

14

Ṁ1kṀ465

2027

24%

chance

1H

6H

1D

1W

1M

ALL

"Outer alignment" as in the model is not incentivized to lie to humans (of course it must still do things, the question isn't just about can you build an AI that doesn't lie)

Market context

Technical AI Timelines

Technical AI Safety

Mechanistic interpretability

Get

1,000

to start trading!

Sort by:

Can you explain what the difference is from your other question here: https://manifold.markets/vluzko/by-2027-will-there-be-a-wellaccepte ?
Does the other question have to handle inner alignment?

Just to be clear, does this count lying due to things other than outer incentives? (i.e inner misalignment) Also, does the system have to be close to SOTA? (maybe we figure out how to make honest LMs but by then our SOTA has moved on to much more dangerous AI systems)

@LeoGao Lying due to other things does not count. For this market the system does not need to be close to SOTA. I will not accept trivial/very stripped down solutions, but otherwise I will be quite forgiving. For instance if someone came out with this today for RNNs I would resolve YES. (The target is a procedure for some architecture/problem setup that can in theory generalize even if no one has achieved that generalization in practice).

People are also trading

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

Will there be a well accepted formal definition for honesty in AI by 2027?

Will deceptive misalignment occur in any AI system before 2030?

By 2028, will I believe that contemporary AIs are aligned (posing no existential risk)?

Will Inner or Outer AI alignment be considered "mostly solved" first?

Conditional on their being no AI takeoff before 2050, will the majority of AI researchers believe that AI alignment is solved?

Conditional on their being no AI takeoff before 2030, will the majority of AI researchers believe that AI alignment is solved?

Related questions

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

Will there be a well accepted formal definition for honesty in AI by 2027?

Will deceptive misalignment occur in any AI system before 2030?

By 2028, will I believe that contemporary AIs are aligned (posing no existential risk)?

Will Inner or Outer AI alignment be considered "mostly solved" first?

Conditional on their being no AI takeoff before 2050, will the majority of AI researchers believe that AI alignment is solved?

Conditional on their being no AI takeoff before 2030, will the majority of AI researchers believe that AI alignment is solved?

© Manifold Markets, Inc.•Terms•Privacy