I'll give bounties to people who suggest reasonable improvements to the criteria.
https://www.twitch.tv/claudeplayspokemon
Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking
Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:
Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.
Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.
With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.
Fine-tuning or reinforcement learning specific to Pokemon (or video games in general) is not allowed.
Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.
RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.
See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync
Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:
Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.
Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.
Update 2025-05-02 (PST) (AI summary of creator comment): Based on a discussion about a specific LLM run (Gemini 2.5 Pro beating Pokémon Blue):
The creator agreed that this specific run does not count towards market resolution.
The reason cited was that substantial mid-game changes to the system's structure, such as introducing a separate "strategist" LLM specifically to solve boulder puzzles mid-run, were considered "significantly over the boundary" of allowed mid-game tweaks.
This type of intervention is considered disallowed mid-game assistance specifically targeting blockers, rather than a general system improvement permitted by the rules.
People are also trading
Gemini 2.5 Pro beat Pokémon Blue earlier today. However I don't think it should count as too many substantial changes were made to the scaffolding throughout the run, including ex. allowing the main Gemini to call a separate Gemini with a prompt dedicated to solving boulder puzzles in Victory Road. (a "strategist" Gemini)
In general it was deliberately a pretty loose, experimental run. Later runs may count.
@JulianBradshaw agreed. I think it was significantly over the boundary of mid-game tweaks.
I’ll have to think about if the structure itself of map labelling is too much. Probably not.
@Sketchy I don't know, possibly https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-now-better-than-claude-at-pokemon and https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG ?
Or https://www.twitch.tv/gemini_plays_pokemon/about ?
Possibly worth waiting until it gets closer to finishing though, as I expect more will be written and it will be easier to decide
@Lorenzo I don't intend to do a detailed writeup on Gemini beating the game, there isn't too much new to say. Here's my quick take on it: https://www.lesswrong.com/posts/ekF2EDwKyZJNuxBTb/julian-bradshaw-s-shortform?commentId=cHqfKsCWtn5T5H4Tr
@JulianBradshaw Thanks!
@Sketchy I found this summary interesting: https://old.reddit.com/r/ClaudePlaysPokemon/comments/1kdjysi/gemini_beats_pokemon/mqblfeh/
@Lorenzo ummm… im going to say no, but I won’t lie it’s in part because it seems a shame to disqualify Claude for a small tweak this early. If they continually tweak it throughout the run, that feels unfair.
I will make some more explicit criteria around this soon, I guess.
@Lorenzo I updated the criteria in a way that allows for this kind of thing midgame, as long as it's not directly hinting towards specific blockers.
Things claude has hallucinated in the 5 minutes I've watched this stream:
- thinks bulbasaur has a type disadvantage against squirtle
- thinks the exit to oaks lab is at the top of the screen
- successfully exited oaks lab, and then went back into it, thinking it was route 1
- went back to the top of the screen after re-entering oaks lab
I don't think 3.7 sonnet is going to be able to do this in any number of tries, I assume it ended up stuck in some sort of infinite loop in the midgame that it couldn't break out of.