Which of these Language Models will beat me at chess?
30
18kṀ29k
2101
90%
Any model announced before 2030
82%
Any model announced before 2028
82%
Any model announced before 2029
80%
Any model announced before 2027
79%
Any open-weights model announced before 2030
57%
Any model announced before 2026
38%
GPT-5
30%
OpenAI o3
30%
DeepSeek-V4
25%
Grok 3
22%
Claude 3.5 Opus
16%
Llama 4

Which of these models will beat me at chess once released? Resolves YES if they win, NO if I win, and 50% for a draw.

I'm rated about 1900 FIDE. When each of these models are released, I'll play a game of chess with them at a rapid time control. On each move, I'll provide them with the game state in PGN and FEN notation. If the models make three illegal moves, they lose. Responses like Nbd2 vs. Nd2 will not count towards this.

Each option will stay open until the model is released, or it will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

  • Update 2025-14-01 (PST) (AI summary of creator comment): - Model Type: Only general language models are being considered; chess-specific models are excluded.

    • Capabilities: The model must be able to output human languages and code.

Get
Ṁ1,000
to start trading!
Sort by:

Have you tried following this guide on improving LLM chess performance: https://dynomight.net/more-chess/

TL;DR: make it repeat the whole game and give three small examples of example boards and legal moves before each turn.

bought Ṁ20 NO

these percents don't make sense

why is a reasoning model less likely to win than 3 non-reasoning models?

bought Ṁ500 NO

Current models are... not close. Illegal moves remain a big problem.
https://www.youtube.com/watch?v=FojyYKU58cw

@AbuElBanat this was published a year ago before the release of o1. In the game that I played against it, o1 played badly but only made one illegal move.

@mr_mino embarrassing oversight. Thanks.

Would a chess specific model count?

@AdamCzene No, I only plan on adding general LLMs. At a minimum the model should also be able to output human languages and code.

FYI latest LLMs are trained on data without chess games because such specific token data degrades performance on other important tasks

@mathvc if this were true, wouldn’t you expect them not to be able to play chess at all? How do you explain o1 playing a full game of chess given only FEN and PGN inputs?

I recently played a game against o1, which I won. o1 made several blunders in this game, I'd estimate its elo to be less than 1000 FIDE. Here is the PGN:

1. e4 e5 2. Nf3 Nc6 3. Bb5 Nf6 4. O-O Nxe4 5. Re1 Nd6 6. Nxe5 Nxe5 7. Rxe5+ Be7 8. d4 Nxb5 9. c4 Nd6 10. c5 Nc4 11. Re2 O-O 12. b3 Na5 13. Nc3 d6 14. Bf4 Bg4 15. Nd5 Bxe2 16. Qxe2 Nc6 17. cxd6 Bxd6 18. Rd1 Re8 19. Bxd6 Rxe2 20. Nxc7 Qxd6 21. Nb5 Rae8 22. Nxd6 Re1+ 23. Rxe1 Rxe1#

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules