Will an LLM become a Pokèmon Master by the end of 2025? [READ DESCRIPTION]
64
1kṀ10k
2026
59%
chance

I'll give bounties to people who suggest reasonable improvements to the criteria.

https://www.twitch.tv/claudeplayspokemon

Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking

Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:

  • Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.

  • Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.

  • With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.

  • Fine-tuning or reinforcement learning specific to Pokemon (or video games in general) is not allowed.

Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.

RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.

See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync

  • Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:

    • Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.

    • Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.

  • Update 2025-05-02 (PST) (AI summary of creator comment): Based on a discussion about a specific LLM run (Gemini 2.5 Pro beating Pokémon Blue):

    • The creator agreed that this specific run does not count towards market resolution.

    • The reason cited was that substantial mid-game changes to the system's structure, such as introducing a separate "strategist" LLM specifically to solve boulder puzzles mid-run, were considered "significantly over the boundary" of allowed mid-game tweaks.

    • This type of intervention is considered disallowed mid-game assistance specifically targeting blockers, rather than a general system improvement permitted by the rules.

Get
Ṁ1,000
to start trading!
Sort by:

Gemini 2.5 Pro beat Pokémon Blue earlier today. However I don't think it should count as too many substantial changes were made to the scaffolding throughout the run, including ex. allowing the main Gemini to call a separate Gemini with a prompt dedicated to solving boulder puzzles in Victory Road. (a "strategist" Gemini)

In general it was deliberately a pretty loose, experimental run. Later runs may count.

@JulianBradshaw agreed. I think it was significantly over the boundary of mid-game tweaks.

I’ll have to think about if the structure itself of map labelling is too much. Probably not.

blackout strat best strat!

safari zone is a massive problem, the step count is limited and it costs money for each attempt. you need a system that is cracked at navigation to get past that

@SaviorofPlant Gemini cleared safari zone, but not sure if its setup qualifies for this market

@Lorenzo I need to look more into the Gemini setup. Is there anywhere with the best overview?

@Sketchy I don't know, possibly https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-now-better-than-claude-at-pokemon and https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG ?

Or https://www.twitch.tv/gemini_plays_pokemon/about ?

Possibly worth waiting until it gets closer to finishing though, as I expect more will be written and it will be easier to decide

bought Ṁ400 YES

@Lorenzo I don't intend to do a detailed writeup on Gemini beating the game, there isn't too much new to say. Here's my quick take on it: https://www.lesswrong.com/posts/ekF2EDwKyZJNuxBTb/julian-bradshaw-s-shortform?commentId=cHqfKsCWtn5T5H4Tr

Does this count as "assistance or steering mid-game."?

@Lorenzo ummm… im going to say no, but I won’t lie it’s in part because it seems a shame to disqualify Claude for a small tweak this early. If they continually tweak it throughout the run, that feels unfair.

I will make some more explicit criteria around this soon, I guess.

@Lorenzo I updated the criteria in a way that allows for this kind of thing midgame, as long as it's not directly hinting towards specific blockers.

Things claude has hallucinated in the 5 minutes I've watched this stream:
- thinks bulbasaur has a type disadvantage against squirtle
- thinks the exit to oaks lab is at the top of the screen
- successfully exited oaks lab, and then went back into it, thinking it was route 1

- went back to the top of the screen after re-entering oaks lab

I don't think 3.7 sonnet is going to be able to do this in any number of tries, I assume it ended up stuck in some sort of infinite loop in the midgame that it couldn't break out of.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules