Ambitious mechinterp is quite unlikely to confidently assess if AIs are deceptively aligned/dangerous in next 10 years

Never closes

Yes

Results

full question is "Ambitious mechanistic interpretability is quite unlikely[1] to be able to confidently assess[2] whether AIs[3] are deceptively aligned (or otherwise have dangerous propensities) in the next 10 years.", from https://www.lesswrong.com/posts/hc9nMipTXy2sm3tJb/vote-on-interesting-disagreements

Mechanistic interpretability

Get

1,000

to start trading!

People are also trading

[MIT AI Risk Initiative] Will an AI system autonomously access restricted high-risk systems or data by end of 2045?

59% chance

At the beginning of 2026, what percentage of Manifold users will believe that an AI intelligence explosion is a significant concern before 2075?

73% chance

The probability of "extremely bad outcomes e.g., human extinction" from AGI will be >5% in next survey of AI experts

79% chance

By 2028, will I believe that contemporary AIs are aligned (posing no existential risk)?

33% chance

Will deceptive misalignment occur in any AI system before 2030?

81% chance

Will there be at least a "close call" with a powerful misaligned AI before 2100?

83% chance

An AI is trustworthy-ish on Manifold by 2030?

47% chance

At the beginning of 2027, what percentage of Manifold users will believe that an AI intelligence explosion is a significant concern before 2075?

69% chance

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

25% chance

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

10 Comments

Sort by:

I am optimistic about Paul Christiano's work in combination with ambitious scalable interpretability with mechinterp validations

@firstuserhere I must say I'm simply unaware of a lot of the alignment agendas. I'd be happy to read and discuss any promising ones recommended to me but it's difficult to sift through such a large space

@firstuserhere and the SAE approach (as shown by HAL for residual stream, Anthropic for MLPs, and other works for attention) makes me optimistic further, despite obvious scaling and other engineering challenges. The possibilities are mind boggling!

I think its probable, not merely possible, to be able to tell when a particule behavior or set of behaviors is active in a model and its also possible to validate for deceptive behavior.

And I also expect that to happen within 5-6 years, not 10.

edit nvm

@jacksonpolack I think that using mech interp techniques, it can be possible to tell what other sorts of behavior are likely to be active if a behavior is active at a moment i.e. what other behaviors this behavior promotes.

I think that using mech interp techniques, it can be possible to tell what other sorts of behavior are likely to be active if a behavior is active at a moment i.e. what other behaviors this behavior promotes.

I think this is a lot less plausible for AI that's as smart as Von Neumann than it is for current GPT4. Predicting what future von Neumann's philosophical or intellectual innovations are going to be is, well, VN-complete, and we already don't know how to make VN trustworthy! You're likely to end up with a pile of different AI pit against each other's misalignment with hacks, incentivized to keep each other in line, that still slowly drifts into 'misalignment'.

@jacksonpolack I think interp work is going to be important and useful in paving the way to a possible solution to "alignment", but by no means I'm claiming it'll be provide us with a solution to alignment, which is also true for any other approach today.

we already don't know how to make VN trustworthy!

What we can do is have a human level aligned researcher, which Superalignment team at OpenAI proposes, and from there on, possibly tackle scaling it to handle OOMs more untrustworthy behaviors that it can evaluate and catalogue them.

I think it's plausible it'd assess if they're deceptively aligned/dangerous in some ways, but not in most/all ways, which is the important bit

@jacksonpolack I understand what you likely mean here but it might be helpful to expand on why you think it'll be hard to capture a huge number of possible ways once you have ability to detect/assess/predict/steer some behaviors