If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
Update 2025-21-01 (PST) (AI summary of creator comment): - LLM identification: A program must be recognized by reputable media outlets (e.g., The Verge) as a Large Language Model (LLM) to qualify for this market.
Self-designation insufficient: Simply labeling a program as an LLM without external media recognition does not qualify it as an LLM for resolution purposes.
This may be an tougher challenge than expected it has 2 years since 2023 and no LLM has even come close to that elo the only llm that came semi-close in chess by just predicting moves based on its dataset was the mysterious gpt 3.5 turbo instruct. If Llms doesn't start playing chess and does not fail to keep track of the board state. I will have to sell by late 2026 or 2027. This is very concerning since this is almost 2 to 3 years away
@Blocksterpen3 Related, o1 pro lost to me easily (which was only the 2nd game of chess I played in years.) It also repeatedly got confused about the state of the board.
https://chatgpt.com/share/675e2bbb-2e88-8009-8382-b72bd610253c
@DavidBolin yeah I hope o3 or even o4 can play a coherent game of chess. Even deep seek r1 fails around move 13
LLMs get better at chess when given three examples of legal moves and their results and asked to repeat the entire previous set of moves before each turn. This can likely be applied to any game.
@RossTaylor Assuming o3 is also text-only, then yes. The "blind" criterion just means the model doesn't get to see pictures of the board
@RossTaylor That's what it would have to be doing implicitly for the CoT to be useful. o3's CoT is almost certainly just text.
@AdriaGarrigaAlonso I think that is actually a rather significant component of this question. You could reframe the resolution as "Will a super grandmaster play a serious game of chess against an LLM by 2030?". Even if LLMs continue to improve at chess (they currently aren't any good), this other contingency has to hold as well. Current market seems high.
@dominic I think the more RLHFed the model is, the worse it is at chess. That's probably why 3.5 instruct is better than 4, 4o, and probably o1.
I might be wrong.
It should be better if the output is constrained to pgn format and fine tuned on stockfish analysis (available on lichess pgn file).
This already a transformer that’s at 2700 just predicting stockfish.
@BrandonNorman If you shit in a box and it beats a grandmaster, I for one will respect whatever you call it.
@RiskComplex We already have chess engines that can beat a grandmaster. The bet here is that specifically an LLM will do it.
@BrandonNorman If you can manage to make The Verge to report on your shit in a box that beats a super GM as an LLM, this market resolves to YES.
@MP You're making argument from authority about an authority I do not respect. The Verge will print stories about what makes them the most money, without regard to its truthfulness.
@JS_81 fine-tune it to what? recall real chess games very well? supergms can do that too, and more besides