Ambitious mechinterp is quite unlikely to confidently assess if AIs are deceptively aligned/dangerous in next 10 years
14
84
Never closes
Yes
No
Results

full question is "Ambitious mechanistic interpretability is quite unlikely[1] to be able to confidently assess[2] whether AIs[3] are deceptively aligned (or otherwise have dangerous propensities) in the next 10 years.", from https://www.lesswrong.com/posts/hc9nMipTXy2sm3tJb/vote-on-interesting-disagreements

Get Ṁ200 play money
Sort by:

I am optimistic about Paul Christiano's work in combination with ambitious scalable interpretability with mechinterp validations

@firstuserhere I must say I'm simply unaware of a lot of the alignment agendas. I'd be happy to read and discuss any promising ones recommended to me but it's difficult to sift through such a large space

@firstuserhere and the SAE approach (as shown by HAL for residual stream, Anthropic for MLPs, and other works for attention) makes me optimistic further, despite obvious scaling and other engineering challenges. The possibilities are mind boggling!

I think its probable, not merely possible, to be able to tell when a particule behavior or set of behaviors is active in a model and its also possible to validate for deceptive behavior.

And I also expect that to happen within 5-6 years, not 10.

edit nvm

@jacksonpolack I think that using mech interp techniques, it can be possible to tell what other sorts of behavior are likely to be active if a behavior is active at a moment i.e. what other behaviors this behavior promotes.

I think that using mech interp techniques, it can be possible to tell what other sorts of behavior are likely to be active if a behavior is active at a moment i.e. what other behaviors this behavior promotes.

I think this is a lot less plausible for AI that's as smart as Von Neumann than it is for current GPT4. Predicting what future von Neumann's philosophical or intellectual innovations are going to be is, well, VN-complete, and we already don't know how to make VN trustworthy! You're likely to end up with a pile of different AI pit against each other's misalignment with hacks, incentivized to keep each other in line, that still slowly drifts into 'misalignment'.

@jacksonpolack I think interp work is going to be important and useful in paving the way to a possible solution to "alignment", but by no means I'm claiming it'll be provide us with a solution to alignment, which is also true for any other approach today.

we already don't know how to make VN trustworthy!

What we can do is have a human level aligned researcher, which Superalignment team at OpenAI proposes, and from there on, possibly tackle scaling it to handle OOMs more untrustworthy behaviors that it can evaluate and catalogue them.

I think it's plausible it'd assess if they're deceptively aligned/dangerous in some ways, but not in most/all ways, which is the important bit

@jacksonpolack I understand what you likely mean here but it might be helpful to expand on why you think it'll be hard to capture a huge number of possible ways once you have ability to detect/assess/predict/steer some behaviors

More related questions