Will model surgery be widely used for AI control by 2027?
3
37
90
resolved Jul 21
Resolved
N/A

Currently widely used AI control technologies are regular training, fine tuning and prompting. Will some version of "model surgery" like activation additions, weight additions or other approaches involving probing or editing weights or activations be widely used for the control of state of the art AI systems.

It counts if, for example, weight pricing is used as a subroutine in training as long as it's motivated by interpretability work and not just some kind of regularisation.

Running interpretability on trained models only counts if a significant number (say, more than 10%) are rejected as a result.

End date is Jan 1 2027.

Get Ṁ200 play money
Sort by:
predicted NO

I think this market is too poorly worded, resolving ambiguous

predicted NO

What on earth did I mean by “weight pricing”? I think “pricing” Is a typo, but I can’t think what for.

Do you count LoRA as "surgery", or does it need to be more invasive than that?

predicted NO

@jonsimon This doesn’t count. Using an “uninformative” initialisation and updating parameters by gradients from a loss I consider normal training; surgery requires altering the model internals based on some understanding of what it achieves not purely derived from loss gradients.

(Let me know if that’s not an accurate description if LoRA)

predicted NO

Here’s my vague intuition: it’s probably not going to be competitive with backpropagation for performance or with prompting for ease of use, and deception probably isn’t going to be a big deal