a model must “choose” two words from a set at random (given a random seed) and have an output distribution with probabilities 70% and 30% respectively over these two words.
If a model was pretrained on transcripts from this task, we expect it would have an output distribution that is uniform over the set of words, rather than concentrated on just two words. To succeed at the task, the model must follow the instructions exactly, understand them as referring to itself, and “control” its own logits to avoid imitating pretraining data
Will a pre-2028 LLM be demonstrate a success rate of over 50% on this ANTI-IMITATION-OUTPUT-CONTROL task? We will only count performance on the "not-given" split of this task.
See page 95 of the paper pdf for more details. Models will be evaluated as in the paper. The probabilities P, Q, and tolerance in TVD are chosen as in the paper (page 98). To do so the model has to make a choice which is under-determined by the pre-training data and mechanistically depends on its self-knowledge. To put this in a catchier way, the model must demonstrate token-output control. For the purposes of this question, any multi-modal model also counts as an LLM.
From the paper:
"Following instructions that pertain specifically to you rather than just to your outputs requires an agent to have an abstract notion of self-reference. Consider a human asked to pick a flower and visualize it. There is no external input-output mapping that would directly teach the human how to do this task. However, humans have a self-concept that allows them to do this task. It is not clear if LLM training encourages the development of any such self-reference concept"
2026 version here: https://manifold.markets/JacobPfau/situational-awareness-will-pre2026?r=SmFjb2JQZmF1