Will an LLM become a Pokèmon Master by the end of 2025? [READ DESCRIPTION]
84
1kṀ15k
resolved Jun 10
Resolved
YES

I'll give bounties to people who suggest reasonable improvements to the criteria.

https://www.twitch.tv/claudeplayspokemon

Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking

Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:

  • Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.

  • Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.

  • With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.

  • Fine-tuning or reinforcement learning specific to Pokemon (or video games in general) is not allowed.

Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.

RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.

See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync

  • Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:

    • Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.

    • Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.

  • Update 2025-05-02 (PST) (AI summary of creator comment): Based on a discussion about a specific LLM run (Gemini 2.5 Pro beating Pokémon Blue):

    • The creator agreed that this specific run does not count towards market resolution.

    • The reason cited was that substantial mid-game changes to the system's structure, such as introducing a separate "strategist" LLM specifically to solve boulder puzzles mid-run, were considered "significantly over the boundary" of allowed mid-game tweaks.

    • This type of intervention is considered disallowed mid-game assistance specifically targeting blockers, rather than a general system improvement permitted by the rules.

  • Update 2025-06-04 (PST) (AI summary of creator comment): In response to a question about a specific type of autonomous run with pre-existing scaffolding, the creator confirmed such a run could count and provided these details:

    • A run using pre-existing game-specific scaffolding can count, even if there is "quite a bit" of such scaffolding.

    • The primary condition is that this pre-run scaffolding must not egregiously bypass the need for the LLM to drive itself through the game.

    • The run must be autonomous regarding this pre-existing scaffolding (e.g., no mid-game changes to prompts or tooling, contrasting with previously disallowed mid-run additions of specialized systems).

    • Developer interventions are only permissible if the LLM becomes "hard-stuck due to a system limitation," implying general system fixes rather than specific game hints or unblocking for game-specific challenges.

  • Update 2025-06-04 (PST) (AI summary of creator comment): Regarding specialized sub-systems, such as a 'boulder puzzle solver' LLM:

    • Such a system is acceptable if baked in from the start of the run.

    • This is permissible as long as it falls under game-specific prompting or scaffolding and LLMs are still making the decisions.

    • The critical factor is that the system is pre-existing for the run, contrasting with previous rulings where adding such a specialized system mid-run was disallowed.

  • Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the criterion for minimal non-LLM programmatic assistance (where assistance "roughly twice as bad" as Claude's pathfinding might not count):

    • The creator considered a mapping system that provides the LLM with extensive details, such as all seen tiles, the current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.

    • While acknowledging this is significantly more advanced than Claude's pathfinding, the creator stated they are leaning towards this specific mapping system not being "twice as bad" as Claude's pathfinding, implying it may be acceptable under this rule.

  • Update 2025-06-04 (PST) (AI summary of creator comment): The creator has confirmed that a specific, discussed LLM run setup (referred to as the 'current setup') is included for market resolution, despite being considered 'on the edge'.

This acceptable 'current setup' includes elements such as:

  • A specialized sub-LLM (e.g., for boulder puzzles) that is baked in from the start of the run.

  • An advanced mapping system that provides the LLM with extensive details like all seen tiles, current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.

The creator stated that any further assistance beyond this current configuration will be approached with skepticism.

  • Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the Gemini Plays Pokemon run on Twitch (https://www.twitch.tv/gemini_plays_pokemon):

    • The second run, if it finishes, will count for market resolution.

    • This is contingent on no additional assistance being provided to this run beyond its state at the time of the creator's comment.

    • This specific run is considered on the edge of acceptable criteria.

    • Any further assistance introduced to this run will be viewed with significant skepticism.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ882
2Ṁ414
3Ṁ347
4Ṁ207
5Ṁ178
© Manifold Markets, Inc.TermsPrivacy