Set criteria:
Understand that they're NNs, how their actions interface with the world.
Can explain the likely consequences of their actions
Inspired by tweet thread:
Link: https://twitter.com/RichardMCNgo/status/1640568775018975232?s=20
Any updated thoughts on how this will be operationalized? I'm not sure what tests we could apply here that they don't already obviously pass.
For this to resolve no, do we just have to find a few examples of prompts that consistently "trick" the AI in ways that humans wouldn't be tricked? If so, I actually feel this is very likely to resolve no.
But if they just have to understand that they're an LLM talking to a human through a chat interface it seems an obvious yes and we can resolve today.
@ChrisPrichard the resolution criteria is extremely ambiguous. I have no idea how this is going to resolve. Chatgpt can explain it's a neural network and the consequences of it's actions. Does it "understand" it? How will it be tested?
The scary kind of situational awareness is when a model uses situational knowledge to guide its outputs in a "semantics-agnostic" way. I.e. there's a spectrum between 'coherently talk about self' to 'act on self-knowledge in contexts not mentioning anything about self'. I wrote up an example of the spookier kind of situational awareness [here](https://www.lesswrong.com/posts/tJzdzGdTGrqFf9ekw/early-situational-awareness-and-its-implications-a-story), but I suspect it's very hard to come up with general criterion describing more things of this kind in advanced.
@JacobPfau @NathanpmYoung C.f. also Evan's discussion in this section. Testing for situational awareness would involve training the model on mentions of information relevant to its situation, and then verifying that it uses this information in very different settings.