I'll give bounties to people who suggest reasonable improvements to the criteria.
https://www.twitch.tv/claudeplayspokemon
Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking
Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:
Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.
Without assistance or steering mid-game.
With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.
Any number of "shots" are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.
RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.
See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync
@Lorenzo ummm… im going to say no, but I won’t lie it’s in part because it seems a shame to disqualify Claude for a small tweak this early. If they continually tweak it throughout the run, that feels unfair.
I will make some more explicit criteria around this soon, I guess.
Things claude has hallucinated in the 5 minutes I've watched this stream:
- thinks bulbasaur has a type disadvantage against squirtle
- thinks the exit to oaks lab is at the top of the screen
- successfully exited oaks lab, and then went back into it, thinking it was route 1
- went back to the top of the screen after re-entering oaks lab
I don't think 3.7 sonnet is going to be able to do this in any number of tries, I assume it ended up stuck in some sort of infinite loop in the midgame that it couldn't break out of.