
Currently widely used AI control technologies are regular training, fine tuning and prompting. Will some version of "model surgery" like activation additions, weight additions or other approaches involving probing or editing weights or activations be widely used for the control of state of the art AI systems.
It counts if, for example, weight pricing is used as a subroutine in training as long as it's motivated by interpretability work and not just some kind of regularisation.
Running interpretability on trained models only counts if a significant number (say, more than 10%) are rejected as a result.
End date is Jan 1 2027.
People are also trading
@jonsimon This doesn’t count. Using an “uninformative” initialisation and updating parameters by gradients from a loss I consider normal training; surgery requires altering the model internals based on some understanding of what it achieves not purely derived from loss gradients.
(Let me know if that’s not an accurate description if LoRA)