Background

Claude 3.7 Sonnet is currently the only LLM reported to have made significant progress in playing Pokémon Red. It has successfully defeated several Gym Leaders and progressed through multiple areas of the game using its "extended thinking" mode. Other LLMs like GPT-4V, Gemini, and Llava have been tested but struggled with the spatial reasoning and navigation required to play effectively.

The technical challenges of playing Pokémon Red include maintaining game state awareness, planning multi-step sequences, and navigating the game world effectively.

Resolution Criteria

This market resolves to YES if:

Any LLM other than Claude completes Pokémon Red by defeating the Elite Four and the Champion before any Claude model does so.

This market resolves to NO if:

Any Claude model (including future versions) completes Pokémon Red by defeating the Elite Four and the Champion first.
No LLM completes Pokémon Red by the market close date.

For resolution purposes:

"Beating Pokémon Red" means completing the main storyline by defeating the Elite Four and the Champion.
The LLM must play autonomously without human assistance beyond initial prompting, scaffolding (tool-use allowed) and setup.
The achievement must be verifiable through credible documentation (video evidence, technical paper, or announcement from a reputable organization).

Considerations

The race to beat Pokémon Red represents a significant AI capability benchmark, as it requires complex reasoning, memory, and planning abilities. While Claude currently has a head start, the field of AI is advancing rapidly, and competitors may develop specialized capabilities to tackle this challenge. Future LLM releases from organizations like OpenAI, DeepSeek, XAI, Google DeepMind or others could potentially surpass Claude's current capabilities in game-playing tasks.

Technology

Technical AI Timelines

OpenAI

Get

1,000

to start trading!

2 Comments

9 Holders

14 Trades

Sort by:

>The LLM must play autonomously without human assistance beyond initial prompting, scaffolding (tool-use allowed) and setup.

I'd argue that the Claude Plays Pokemon is already giving Claude too much assistance.

I had to juggle all that in my memory when I played, but Claude just has it constantly printed and automatically updated. It also has 25 years of Pokemon Red walkthroughs in its training data, and a pathfinding tool!
If another company tries to beat Anthropic, this could devolve into a who-can-write-the-best-ai-scaffolding tool.

@GG In the limit, you could design an "AI" that simply reads the RNG seed from RAM, then consults a library which lists each of the 256 possible seeds and an exact down-the-button-press walkthrough of how to beat that seed.