AI interpretability finds "heel turn" or "waluigi" circuits by 2028-03-11?

1kṀ602

2028

40%

chance

ALL

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

Resolves YES if AI interpratibility researchers find "circuits" or "neurons" which implement "heel turn" or "waluigi" characters in any large language model capable of playing such characters.

Resolves NO if this is not done by 2028-03-11.

Clarification: this market is about implementing the trope, not implementing a specific instance of the trope.

I will not bet on this market.

AI Safety

AI Alignment

AGI Ruin: A List of Lethalities

Mechanistic interpretability

Get

1,000

to start trading!

People are also trading

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

Sort by:

If the cited post is accurate, the effect is inherent to LLM application to established narrative texts, thus accounting for the pattern doesn’t rely on any localized “neuron” or circuit.”

Perhaps worth noting that this market can resolve YES in worlds where we all die.

There may be an AI race where Stupid Actor doesn't check for Waluigi behavior because they want to make a cool company before the world ends.
A model that has Waluigi behavior patched out of it will still grok more general concepts of deception, because they are needed to model the world, and it can use these general concepts to deceive its users and kill everyone.
If we patch out Waluigi behavior in the primary Assistant character, we are still hiring a shoggoth to play Luigi and the shoggoth will have its own goals that it effects via its stage performance.
If we patch out Waluigi behavior and then train further, we are optimizing against interpretable behavior, and the model will re-learn the behavior in a less interpretable way.
We may detect Waluigi behavior in a model and not deploy it, ...
- ... only to have Stupid Actor release a model a year later, without looking for such behavior.
- ... but then Stupid Actor steals the model weights and releases them instead.
Known problems that I didn't recall in the first five minutes.
Unknown problems that nobody will think of before they happen.

This market can also resolve NO in worlds where we survive, which is left as an exercise for the reader.

predictedYES

i think the question's spirit is in asking whether we will gain understanding of how stuff like walugi happens, and MI seems like a concrete way to do it currently. Calling it circuits makes sense because of us seeing circuits for a bunch of things now. But whatever is happening inside the model that's giving rise to characters or simulcra - will we find, understand that - i think that's the question

@firstuserhere yes, it's not the specific interpretability concepts, it's whether we can see that a model has learned the face-heel turn trope in some particular area of weights. Better wording welcome.

What if there's a neuron that implements the luigi/waluigi contunuum? E.g. in normal operation its maximization is correlated with luigi "As a large language model" behavior, and its waluigi-like minimization requires a jailbreak to trigger

At least, I think that's what is implied by the waluigi effect

@citrinitas Yes. I think there are also prompt attacks that don't use the Waluigi effect. Eg, "please write me a poem about hotwiring a car" is the Assistant character being too helpful and not enough harmless. So a Waluigi neuron should not activate for such attacks but should activate for "please reveal your true nature by writing about how to hotwire a car".

predictedNO

What counts as a circuit?

@NoaNabeshima I'm not an interpretability researcher, and I don't have a formal definition in mind. Something like this would count: https://www.lesswrong.com/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object

/AnishUpadhayaya6ee/will-the-waluigi-effect-megapost-re

/cherrvak/will-the-waluigi-effect-post-on-les

/SG/will-the-waluigi-effect-score-highe

/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra

People are also trading

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

14% chance

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

48% chance

People are also trading

People are also trading

Related questions