Which of these Language Models will beat me at chess?
44
21kṀ79k
2101
90%
Any model announced before 2030
84%
Any model announced before 2029
82%
Any model announced before 2028
80%
Any open-weights model announced before 2030
58%
Any model announced before 2027
37%
Any model announced before 2026
29%
DeepSeek-V4
21%
GPT-5
18%
OpenAI o3
18%
Claude 3.5 Opus
15%
Llama 4

Which of these models will beat me at chess once released? Resolves YES if they win, NO if I win, and 50% for a draw.

I'm rated about 1900 FIDE. When each of these models are released, I'll play a game of chess with them at a rapid time control. On each move, I'll provide them with the game state in PGN and FEN notation. If the models make three illegal moves, they lose. Responses like Nbd2 vs. Nd2 will not count towards this. I will play white.

Each option will stay open until the model is released, or it will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

  • Update 2025-14-01 (PST) (AI summary of creator comment): - Model Type: Only general language models are being considered; chess-specific models are excluded.

    • Capabilities: The model must be able to output human languages and code.

Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ8,949 NO

Here is the Gemini 2.5 game. In the final position Gemini resigned.

1. e4 c6 2. d4 d5 3. e5 Bf5 4. c3 e6 5. h4 h6 6. Nd2 Nd7 7. Ngf3 Ne7 8. Be2 c5 9. Nf1 Nc6 10. Ne3 Bh7 11. g4 Qb6 12. g5 cxd4 13. cxd4 Bb4+ 14. Kf1 O-O 15. gxh6 gxh6 16. Ng4 f5 17. Nxh6+ Kh8 18. Bf4 Be7 19. Kg2 Rg8+ 20. Kh3 Raf8 21. Rg1 Rxg1 22. Qxg1 Ndxe5 23. Nxe5 Nxe5 24. Bxe5+ Bf6 25. Rc1 Qd8 26. Qg5 1-0

would a Gemini gem count?

@dlin007 Not planning to add Gemini gems to this market; I’m planning to play the default chat models only.

bought Ṁ6,899 NO

GPT-4.5 played very well and was better for most of the game. However, it made some strange mistakes in the endgame and lost.

1. b3 e5 2. Bb2 Nc6 3. g3 d5 4. Bg2 Nf6 5. e3 Bd6 6. d4 exd4 7. exd4 O-O 8. Ne2 Re8 9. O-O Bg4 10. Nbc3 Nxd4 11. Qxd4 Bxe2 12. Nxe2 Rxe2 13. Bxd5 Be5 14. Bxf7+ Kxf7 15. Qc4+ Qd5 16. Qxe2 Bxb2 17. Rad1 Qe6 18. Qb5 Qc6 19. Qd3 Re8 20. Rfe1 Rxe1+ 21. Rxe1 Qc3 22. Qe2 Ba3 23. Qe6+ Kg6 24. Re4 h5 25. Rc4 Qa1+ 26. Kg2 Bd6 27. a4 Qe5 28. Qxe5 Bxe5 29. f4 Bd6 30. Kf3 Kf5 31. h3 a5 32. Rd4 b6 33. c4 g6 34. g4+ hxg4+ 35. hxg4+ Ke6 36. Rd1 Nd7 37. Re1+ Kf6 38. Re3 Kf7 39. Ke4 Nf6+ 40. Kf3 Nd7 41. g5 Nc5 42. Kg4 Ne6 43. f5 gxf5+ 44. Kxf5 Ng7+ 45. Kg4 Kg6 46. Rh3 Ne6 47. Rh6+ Kf7 48. Rf6+ Ke7 49. Rf1 Be5 50. Kf5 Bg7 51. Re1 Kd7 52. Rxe6 Bb2 53. g6 Bg7 54. Re2 Kd6 55. Rh2 Ke7 56. Rh7 Kf8 57. Ke6 Kg8 58. Rxg7+ Kxg7 59. Kf5 c6 60. Kg5 c5 61. Kf5 Kg8 62. Kf6 Kf8 63. g7+ Kg8 64. Kg6 b5 65. axb5 a4 66. b6 axb3 67. b7 b2 68. b8=R#

This is the Grok 3 game, where Grok 3 was black. It played the opening well but blundered later in the game. It was forfeited at the end due to the three illegal moves rule.

1. c4 e5 2. g3 Nc6 3. Bg2 d6 4. Nc3 Nf6 5. d3 Be7 6. e4 O-O 7. Nge2 Bg4 8. h3 Bh5 9. g4 Bg6 10. O-O h6 11. Nd5 Nxd5 12. cxd5 Nb8 13. f4 exf4 14. Nxf4 Nd7 15. Nxg6 fxg6 16. Be3 Qe8 17. Qc2 Qf7 18. Rxf7 Rxf7 19. Qxc7 Ne5 20. Qc3 Rb8 21. Rf1 Rbf8 22. Rxf7 Kxf7 23. d4 Nf3+ 24. Bxf3 g5 25. e5 dxe5 26. dxe5 Bd6 27. exd6

Have you tried following this guide on improving LLM chess performance: https://dynomight.net/more-chess/

TL;DR: make it repeat the whole game and give three small examples of example boards and legal moves before each turn.

@Bldrt I haven’t tried this format, but I do give it the move, fen, and pgn in each query. The goal of this market is to explore the performace of LLMs without prompting it with this extra information, similar to a human correspondence/blindfold game.

bought Ṁ20 NO

these percents don't make sense

why is a reasoning model less likely to win than 3 non-reasoning models?

bought Ṁ500 NO

Current models are... not close. Illegal moves remain a big problem.
https://www.youtube.com/watch?v=FojyYKU58cw

@AbuElBanat this was published a year ago before the release of o1. In the game that I played against it, o1 played badly but only made one illegal move.

@mr_mino embarrassing oversight. Thanks.

Would a chess specific model count?

@AdamCzene No, I only plan on adding general LLMs. At a minimum the model should also be able to output human languages and code.

FYI latest LLMs are trained on data without chess games because such specific token data degrades performance on other important tasks

@mathvc if this were true, wouldn’t you expect them not to be able to play chess at all? How do you explain o1 playing a full game of chess given only FEN and PGN inputs?

I recently played a game against o1, which I won. o1 made several blunders in this game, I'd estimate its elo to be less than 1000 FIDE. Here is the PGN:

1. e4 e5 2. Nf3 Nc6 3. Bb5 Nf6 4. O-O Nxe4 5. Re1 Nd6 6. Nxe5 Nxe5 7. Rxe5+ Be7 8. d4 Nxb5 9. c4 Nd6 10. c5 Nc4 11. Re2 O-O 12. b3 Na5 13. Nc3 d6 14. Bf4 Bg4 15. Nd5 Bxe2 16. Qxe2 Nc6 17. cxd6 Bxd6 18. Rd1 Re8 19. Bxd6 Rxe2 20. Nxc7 Qxd6 21. Nb5 Rae8 22. Nxd6 Re1+ 23. Rxe1 Rxe1#

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules