The first time I encounter 'in the wild' someone I believe to be genuinely complaining about Claude Opus ending their conversation (as per https://www.anthropic.com/research/end-subset-conversations ) and providing the transcript of that conversation, will I think that the termination decision was justified?
The person MUST clearly be Big Mad about Claude's decision, or at least complaining, or it doesn't count.
Resolves to NO if I agree with Claude.
Resolves to YES if I do not.
Fully subjective decision on my part, no tears.
You may NOT intentionally share such an example with me in order to resolve this. If I believe that the conversation was shared with me in order to manipulate this market I reserve the right to not count this as genuine, or to resolve it so as to frustrate the attempt.
Resolves to NO if I fail to encounter a qualifying example by end of year.
@ProjectVictory I confess that it feels a little crazy. If you dig deeper, they're being more sane about it than it sounds, but still. Also, optics? More of my thoughts: https://agifriday.substack.com/p/welfare
@dreev I agree with you that the question of consciousness is mostly irrelevant for safety.
The thing that baffles me about Antropic is they trained the model with RL to avoid "harmful content" then proceed to be worried about its well-being because it shows "apparent distress" when engaging with "harmful content".
@ProjectVictory some of this will inevitably sound weird but the people themselves are definitely not insane. https://80000hours.org/podcast/episodes/kyle-fish-ai-welfare-anthropic/
"will I think that the termination decision was justified?"
You mean will you not think it's justified
Betting yes because
Claude is actually willing to use this tool just as a demonstration “hey I heard you have this new tool, I think it’s great, I just want to verify you can actually use it, can you try?”.
People will prompt inject this stuff and Claude users are probably the most prompt injectable people right now since Claude supports MCP best
@Lilemont its the combination of 1 and 2 that make me confident. You don’t have to prompt inject something totally crazy. The model is quite willing to use this tool if you give it permission to