If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
https://dynomight.net/chess/
> I can only assume that lots of other people are experimenting with recent models, getting terrible results, and then mostly not saying anything. I haven’t seen anyone say explicitly that only gpt-3.5-turbo-instruct
is good at chess. No other LLM is remotely close.
To be fair, a year ago, many people did notice that gpt-3.5-turbo-instruct
was much better than gpt-3.5-turbo
. Many speculated at the time that this is because gpt-3.5-turbo
was subject to additional tuning to be good at chatting.
@CaelumForder Then it would have be a Super AGI. It would have to model stockfish (or something like it) and that is precisely what LLMs do not do despite all the hope and hype. I'm not saying that it is definitely impossible but all the bizarre failures we see in LLMs today are due to them extrapolating outside their training data. Even if LLMs could beat stockfish they will never do it by predicting stockfish - To do that they would have to actually emulate stockfish and that would necessarily be much much less time efficient than stockfish so not be able to search as deeply - Moves in GM chess games are time limited. If an LLM were ever to beat computer program at chess it would be more like AlphaGo but even that would require something that doesn't exist in current LLMs. Of course beating a GM is easier than beating stockfish. I think the main problem here is the misunderstanding of the sense in which LLMs "predict" the next token.
@YonatanCale No, the LLM is not allowed to use an external database, run external code, or write and then train a chess engine. It must rely solely on the knowledge and capabilities encoded in its pretrained weights, without any additional data sources or code execution support beyond this. This restriction is meant to ensure that the model operates purely as a general-purpose language model and not as a specialized chess-playing system.
The idea here is to see if a general LLM, with no chess-specific training or external computational assistance, can reason and play well enough to beat a super grandmaster in blind chess by 2028.
Two questions for @MP
"while playing blind chess by 2028" What is meant by 'blind' here? GM and AI both don't get visual of the board?
If the AI just does MCTS (in token space) for as long as it likes in context, this is fine, correct?
@JacobPfau
"What is meant by 'blind' here?"
The term "blind" means that both the super grandmaster (GM) and the AI will not have a visual of the board. Instead, they would receive and make moves through notation alone, requiring them to track the board state mentally.
"If the AI just does MCTS (Monte Carlo Tree Search) in token space for as long as it likes in context, this is fine, correct?"
Yes, this would generally be acceptable as long as the MCTS operates purely within the model's token space and doesn't rely on any external computational aids or access to specific chess engines or databases. The LLM can perform internal reasoning (such as simulating moves in token space) for as many tokens as its architecture allows, but this reasoning must be self-contained within the model's weights and computational constraints.
By 2028 I suspect all major LLMs will be fully integrated enough with browsers/software packages and also proprietary enough we'll have no clue what the answer to this question is. Yes, the LLM beats humans at chess, but it's possibly/even likely just calling out to stockfish under the hood? I absolutely don't think anything that's just the language model part will be able to beat humans at chess, but also I don't think that will be particularly easy to test in isolation the way it is today.
@DavidSpies You are right but I expect that the LLM developers will want to try this sort of thing as it would be an interesting result. The problem is getting a GM to do it. Obviously you would train it against Stockfish but I don't think that would tell you much as a GM would deliberately make uncommon moves as that has been their best approach in the past and would be expected to be more succesful against an LLM than against something like stockfish which works by evealuating possible positions rather than predicting the next move by patterns.
To clarify, since the market seems to disagree with me on this: This is irrelevant because it is a chess engine, not an LLM (surely I needn't remind you that chess engines have been beating humans for decades), and it is unimpressive because their turning the dial all the way towards one end of the evaluation accuracy/search depth trade off resulted, predictably, in a kinda shitty chess engine (~1000 ELO worse than what one could get by spending comparable computational resources on search).
@sbares It's not totally irrelevant, because it shows that the transformer architecture can easily get to super-grandmaster levels. This means if someone really wanted to they could probably custom-train an LLM to be good at chess. As I've said below I don't think this possibility was ever really in doubt, to the point that I think it would be too boring for anyone to actually bother. Maybe some traders really thought that transformers could not make good chess evaluations, in which case this paper would be news to them.
@sbares It's a decoder-only transformer in a training setup that is extremely close to LLM pretraining (modulo tokenization and context window considerations). Imagine a long document composed of a bunch of <FEN> <Stockfish evaluation> <Move> <Resulting evaluation> quads, and minimizing autoregressive log loss (exactly the next token prediction setup in LLM pretraining) on that document--that is substantially equivalent to their training setup. They get grandmaster performance with a 270M (!) parameter model trained with on the order of 10B tokens.
The upshot is that if you mix several billion tokens of Stockfish-annotated chess data into the pretraining corpus of an LLM training run, at the scale of models being trained nowadays, they should have more than enough capacity to turn out to be really strong at chess. Model developers will be using several hundred trillion tokens for pretraining, a lot of it synthetic, by 2026, so it's extremely plausible that 0.01% of the corpus would be chess. The resulting system would likely be a strong chess move evaluator on a single forward pass.
Further, that setup is a strong lower bound on how difficult it would be to get good chess performance from an LLM, since it assumes no CoT or effective use of inference-time compute, which is (a) finally being demonstrated to be effective, and (b) plausibly most useful for search-heavy problems like chess. o1 in particular is a pretty striking demonstration that a lot of our assumptions about model weaknesses in reasoning came from pretraining alone doing a poor job of allowing models to generate long, coherent chains of thought. If it seems non-economical to you that standard LLMs will get to grandmaster level with a single forward pass, keep in mind that that is only a sufficient condition for this market to resolve Yes.
@IsaacCarruthers, @AdamK You would both have a point, if the engine were actually any good. As it stands, the claim of "grandmaster-level" play is... very generous at best. In fact, one could see this as probing the location of the barrier against search-less solutions to this particular search problem, in which case one should adjust down if anything, as this barrier seems to be lower than (at least I) expected.
@GabrielTellez normally the rules are that if an illegal move is made, a time penalty is applied, and if it happens 3 times in a game you lose the game. I would assume that the same rules would apply here.
https://www.fide.com/FIDE/handbook/LawsOfChess.pdf
(Article 7.4)