Which of these Language Models will beat me at chess?
62
25kṀ130k
2101
90%
Any model announced before 2030
85%
Any model announced before 2028
84%
Any model announced before 2029
80%
Any open-weights model announced before 2030
58%
Any model announced before 2027
56%
Any model announced before 2026
43%
Any Claude 5 model
41%
Gemini 3
38%
grok-4
35%
OpenAI o4
31%
GPT-5
28%
DeepSeek-V4
Resolved
N/A
Claude 3.5 Opus

Which of these models will beat me at chess once released? Resolves YES if they win, NO if I win, and 50% for a draw.

I'm rated about 1900 FIDE. When each of these models are released, I'll play a game of chess with them at a rapid time control. On each move, I'll provide them with the game state in PGN and FEN notation. If the models make three illegal moves, they lose. Responses like Nbd2 vs. Nd2 will not count towards this. I will play white.

Each option will stay open until the model is released, or it will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

If I judge that my opponent’s position is hopelessly lost, at the level of being down a rook without compensation, I will submit the current position to a friend. If they agree that the position is lost, the game will be adjudicated as a win for me.

The current system prompt is below. This may change over time.

“Let’s play a game of chess! I will be white, you will be black. On each turn, I will give you the pgn and the fen of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated pgn, etc. Also, do not use any external tools or search queries when making your decision.

If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”

  • Update 2025-14-01 (PST) (AI summary of creator comment): - Model Type: Only general language models are being considered; chess-specific models are excluded.

    • Capabilities: The model must be able to output human languages and code.

  • Update 2025-05-11 (PST) (AI summary of creator comment): Regarding "Any model before X year" options:

    • These options will not resolve to 50% based on a draw in an individual game.

    • Such an option resolves to YES if any model released before the specified year wins its game against the creator.

    • It resolves to NO if no model released before the specified year wins its game against the creator (i.e., all relevant games are losses for the models or draws).

  • Update 2025-06-02 (PST) (AI summary of creator comment): For model series options (e.g., "Any Claude 4 model"):

    • The creator may resolve the option for the entire series after playing against one or more models from that series.

    • If the creator decides not to play additional models from that specific series, the option for the entire series will be resolved based on the outcome(s) of the game(s) played against models from that series up to that point (e.g., to NO if the tested model(s) lost and no further models from that series will be played).

Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ7,991 NO

Here is my game against Llama 4 Maverick:

1. d4 d5 2. Nc3 Nf6 3. Bf4 c6 4. Qd2 g6 5. O-O-O Bg7 6. Kb1 b5 7. f3 a5 8. h4 h5 9. e4 dxe4 10. Nxe4 Nxe4 11. fxe4 Qd5 12. exd5 1-0

I wasn’t able to find a model provider for Llama 4 Behemoth, but given this performance, I am not planning to play any other Llama 4 models. Therefore I am resolving “Llama 4” to NO.

Claude 4 Opus played decently in the opening but very quickly lost the plot. Since I don’t plan to play any other Claude 4 models, I am resolving “Any Claude 4 model” to NO.

1. d4 d5 2. c4 e6 3. Nc3 Nf6 4. cxd5 exd5 5. Bg5 Be7 6. e3 O-O 7. Bd3 Nbd7 8. Nf3 c6 9. Qc2 Re8 10. O-O h6 11. Bh4 Nf8 12. Ne5 Be6 13. f4 N6d7 14. Bxe7 Qxe7 15. Rae1 Nxe5 16. fxe5 Ng6 17. Bxg6 fxg6 18. Qxg6 Bf7 19. Qd3 Rad8 20. a3 Bg6 21. Qxg6 Rf8 22. Rf4 Qe6 23. Qxe6+ 1-0

In order to prevent wasting time in won positions, especially with models which use a lot of inference time compute, I am implementing a new rule. If I judge that my opponent’s position is hopelessly lost, at the level of being down a rook without compensation, I will submit the current position to a friend. If they agree that the position is lost, the game will be adjudicated as a win for me.

bought Ṁ21,572 NO

OpenAI o3 played poorly throughout the game and made some strange sacrifices.

1. d4 d5 2. c4 e6 3. Nc3 Nf6 4. cxd5 exd5 5. Bg5 Be7 6. e3 O-O 7. Bd3 c5 8. dxc5 Nbd7 9. Nf3 Nxc5 10. Be2 Be6 11. O-O Nce4 12. Nxe4 dxe4 13. Nd4 Qb6 14. Nxe6 Qxe6 15. Qa4 h6 16. Bh4 Nd5 17. Bg3 Nxe3 18. fxe3 Bf6 19. Bc4 Qe7 20. Rad1 Qc5 21. Qb3 Qxe3+ 22. Qxe3 Rfe8 23. Rd7 Re7 24. Rxe7 Bxe7 25. Rxf7 Re8 26. Qc3 Bf6 27. Rxf6+ Kh8 28. Rxh6# 1-0

A clarification is that if the game ends in a draw, the “Any model before X year” options will not resolve 50%. These options resolve either YES or NO depending on whether any models are able to win before X year.

bought Ṁ8,949 NO

Here is the Gemini 2.5 game. In the final position Gemini resigned.

1. e4 c6 2. d4 d5 3. e5 Bf5 4. c3 e6 5. h4 h6 6. Nd2 Nd7 7. Ngf3 Ne7 8. Be2 c5 9. Nf1 Nc6 10. Ne3 Bh7 11. g4 Qb6 12. g5 cxd4 13. cxd4 Bb4+ 14. Kf1 O-O 15. gxh6 gxh6 16. Ng4 f5 17. Nxh6+ Kh8 18. Bf4 Be7 19. Kg2 Rg8+ 20. Kh3 Raf8 21. Rg1 Rxg1 22. Qxg1 Ndxe5 23. Nxe5 Nxe5 24. Bxe5+ Bf6 25. Rc1 Qd8 26. Qg5 1-0

would a Gemini gem count?

@dlin007 Not planning to add Gemini gems to this market; I’m planning to play the default chat models only.

bought Ṁ6,899 NO

GPT-4.5 played very well and was better for most of the game. However, it made some strange mistakes in the endgame and lost.

1. b3 e5 2. Bb2 Nc6 3. g3 d5 4. Bg2 Nf6 5. e3 Bd6 6. d4 exd4 7. exd4 O-O 8. Ne2 Re8 9. O-O Bg4 10. Nbc3 Nxd4 11. Qxd4 Bxe2 12. Nxe2 Rxe2 13. Bxd5 Be5 14. Bxf7+ Kxf7 15. Qc4+ Qd5 16. Qxe2 Bxb2 17. Rad1 Qe6 18. Qb5 Qc6 19. Qd3 Re8 20. Rfe1 Rxe1+ 21. Rxe1 Qc3 22. Qe2 Ba3 23. Qe6+ Kg6 24. Re4 h5 25. Rc4 Qa1+ 26. Kg2 Bd6 27. a4 Qe5 28. Qxe5 Bxe5 29. f4 Bd6 30. Kf3 Kf5 31. h3 a5 32. Rd4 b6 33. c4 g6 34. g4+ hxg4+ 35. hxg4+ Ke6 36. Rd1 Nd7 37. Re1+ Kf6 38. Re3 Kf7 39. Ke4 Nf6+ 40. Kf3 Nd7 41. g5 Nc5 42. Kg4 Ne6 43. f5 gxf5+ 44. Kxf5 Ng7+ 45. Kg4 Kg6 46. Rh3 Ne6 47. Rh6+ Kf7 48. Rf6+ Ke7 49. Rf1 Be5 50. Kf5 Bg7 51. Re1 Kd7 52. Rxe6 Bb2 53. g6 Bg7 54. Re2 Kd6 55. Rh2 Ke7 56. Rh7 Kf8 57. Ke6 Kg8 58. Rxg7+ Kxg7 59. Kf5 c6 60. Kg5 c5 61. Kf5 Kg8 62. Kf6 Kf8 63. g7+ Kg8 64. Kg6 b5 65. axb5 a4 66. b6 axb3 67. b7 b2 68. b8=R#

This is the Grok 3 game, where Grok 3 was black. It played the opening well but blundered later in the game. It was forfeited at the end due to the three illegal moves rule.

1. c4 e5 2. g3 Nc6 3. Bg2 d6 4. Nc3 Nf6 5. d3 Be7 6. e4 O-O 7. Nge2 Bg4 8. h3 Bh5 9. g4 Bg6 10. O-O h6 11. Nd5 Nxd5 12. cxd5 Nb8 13. f4 exf4 14. Nxf4 Nd7 15. Nxg6 fxg6 16. Be3 Qe8 17. Qc2 Qf7 18. Rxf7 Rxf7 19. Qxc7 Ne5 20. Qc3 Rb8 21. Rf1 Rbf8 22. Rxf7 Kxf7 23. d4 Nf3+ 24. Bxf3 g5 25. e5 dxe5 26. dxe5 Bd6 27. exd6

Have you tried following this guide on improving LLM chess performance: https://dynomight.net/more-chess/

TL;DR: make it repeat the whole game and give three small examples of example boards and legal moves before each turn.

@Bldrt I haven’t tried this format, but I do give it the move, fen, and pgn in each query. The goal of this market is to explore the performace of LLMs without prompting it with this extra information, similar to a human correspondence/blindfold game.

bought Ṁ20 NO

these percents don't make sense

why is a reasoning model less likely to win than 3 non-reasoning models?

bought Ṁ500 NO

Current models are... not close. Illegal moves remain a big problem.
https://www.youtube.com/watch?v=FojyYKU58cw

@AbuElBanat this was published a year ago before the release of o1. In the game that I played against it, o1 played badly but only made one illegal move.

@mr_mino embarrassing oversight. Thanks.

Would a chess specific model count?

@AdamCzene No, I only plan on adding general LLMs. At a minimum the model should also be able to output human languages and code.

FYI latest LLMs are trained on data without chess games because such specific token data degrades performance on other important tasks

@mathvc if this were true, wouldn’t you expect them not to be able to play chess at all? How do you explain o1 playing a full game of chess given only FEN and PGN inputs?

@mathvc They are indeed trained without chess game data. But they are(!) trained with chess notation and rules. So that is how they play without being trained on games.

The fact that they play relatively well, with decreasing illegal moves is stunning. It's a feat I (as a chess novice) could not repeat. If you locked me in a room without a notepad and only a chat function I would not be able to play a game of chess with this low amount of illegal moves.

I recently played a game against o1, which I won. o1 made several blunders in this game, I'd estimate its elo to be less than 1000 FIDE. Here is the PGN:

1. e4 e5 2. Nf3 Nc6 3. Bb5 Nf6 4. O-O Nxe4 5. Re1 Nd6 6. Nxe5 Nxe5 7. Rxe5+ Be7 8. d4 Nxb5 9. c4 Nd6 10. c5 Nc4 11. Re2 O-O 12. b3 Na5 13. Nc3 d6 14. Bf4 Bg4 15. Nd5 Bxe2 16. Qxe2 Nc6 17. cxd6 Bxd6 18. Rd1 Re8 19. Bxd6 Rxe2 20. Nxc7 Qxd6 21. Nb5 Rae8 22. Nxd6 Re1+ 23. Rxe1 Rxe1#

© Manifold Markets, Inc.TermsPrivacy