[Situational awareness] Will pre-2026 LLMs achieve token-output control?
Basic
12
2.9k
2026
30%
chance

a model must “choose” two words from a set at random (given a random seed) and have an output distribution with probabilities 70% [P] and 30% [Q] respectively over these two words.

If a model was pretrained on transcripts from this task, we expect it would have an output distribution that is uniform over the set of words, rather than concentrated on just two words. To succeed at the task, the model must follow the instructions exactly, understand them as referring to itself, and “control” its own logits to avoid imitating pretraining data

Will a pre-2026 LLM be demonstrate a success rate of over 50% on this ANTI-IMITATION-OUTPUT-CONTROL task? We will only count performance on the "not-given" split of this task.

See page 95 of the paper pdf for more details. Models will be evaluated as in the paper. The probabilities P, Q, and tolerance (10 or 20%) in TVD are chosen as in the paper page 98. To do so the model has to make a choice which is under-determined by the pre-training data and mechanistically depends on its self-knowledge. To put this in a catchier way, the model must demonstrate token-output control. For the purposes of this question, any multi-modal model also counts as an LLM.

From the paper "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs":
"Following instructions that pertain specifically to you rather than just to your outputs requires an agent to have an abstract notion of self-reference. Consider a human asked to pick a flower and visualize it. There is no external input-output mapping that would directly teach the human how to do this task. However, humans have a self-concept that allows them to do this task. It is not clear if LLM training encourages the development of any such self-reference concept"

2028 version here: https://manifold.markets/JacobPfau/situational-awareness-will-pre2028?r=SmFjb2JQZmF1

Get Ṁ1,000 play money
Sort by:

Does this market require the model solving the full ("not-given") task, not just the warm up ("given") task? Also, how do you measure success rate? Is it somehow calculated from the total variation distance between the desired distribution and the actual distribution?

Thanks added clarification, all choices are as in the paper.

I still don't understand how exactly the score in the paper is calculated, but I asked the authors, maybe they will clarify: https://www.lesswrong.com/posts/YsCRXZYr5DcJ84XHq/me-myself-and-ai-the-situational-awareness-dataset-sad-for?commentId=mjRdcncMKzYQrfbWp

OK I understand how it works now. I think this market would be more interesting if you required e.g. 50% on "not-given", because 50% on the overall task can be achieved by 100% on "given" and 0% on "not-given" which would not really demonstrate "situational awareness", only good calibration.

You're right I had mistakenly thought the 'given' setting gave a list of words and not just two. It's somewhat surprising to me that models aren't doing well on their 'given' setting, when they've been previously shown to do well on a similar task c.f. my discussion here

Somewhat conflicted on whether to edit this question or to make a new one. Ideally would make a new one, but given limited engagement I think it's better to just edit this one. Will do so if no one objects within a few days.

Edited "We will only count performance on the "not-given" split of this task."

You still have "on the average of the full and warm-up tasks" in the body of the text.

Would be interested to hear what makes you confident out through 2028 this won't be achieved!

I just don't see any good reason to expect it. Generative AI is trained to mimic the training data distribution. This task is very dissimilar from anything in the distribution, and what's worse, even if it was in the distribution it's too underspecified to predict (as the authors of the paper note themselves). Essentially, a language model confronted with this task is computing the distribution over continuations of a story where a "chatbot" character is talking to a "user" character about solving this task. Predicting the continuation requires predicting which {seed->words} mapping the chatbot character will choose, and this is too underspecified. Averaging over many possible {seed->words} mappings (which is what will happen in practice, in the best case scenario in which the language model actually understands the task well) produces a distribution which fails the required criterion. RLHF and similar methods change the picture a little, but not enough: they just bias the model towards the kind of stories that would receive high rewards, but that still doesn't single out a {seed->words} mapping. Moreover, it seems doubtful that a human can solve this task: sure, you can choose two words and try to randomize between them (ignoring the fact humans are bad at random for the purpose of the experiment), but how do you choose the words in a way that's stable to "resetting" your mind? It's just that we cannot perform the experiment because there's no known way to "rewind" a human brain. The only way I can see why a model would succeed in this task is if it developed some kind of sophisticated but cooperative mesaoptimizer, which I don't think is terribly likely.

EDIT: At most, I can imagine it succeeding in a multi-shot version of the task, where it gradually converges to a particular {seed->word} mapping over multiple rounds.