If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
Two questions for @MP
"while playing blind chess by 2028" What is meant by 'blind' here? GM and AI both don't get visual of the board?
If the AI just does MCTS (in token space) for as long as it likes in context, this is fine, correct?
By 2028 I suspect all major LLMs will be fully integrated enough with browsers/software packages and also proprietary enough we'll have no clue what the answer to this question is. Yes, the LLM beats humans at chess, but it's possibly/even likely just calling out to stockfish under the hood? I absolutely don't think anything that's just the language model part will be able to beat humans at chess, but also I don't think that will be particularly easy to test in isolation the way it is today.
To clarify, since the market seems to disagree with me on this: This is irrelevant because it is a chess engine, not an LLM (surely I needn't remind you that chess engines have been beating humans for decades), and it is unimpressive because their turning the dial all the way towards one end of the evaluation accuracy/search depth trade off resulted, predictably, in a kinda shitty chess engine (~1000 ELO worse than what one could get by spending comparable computational resources on search).
@sbares It's not totally irrelevant, because it shows that the transformer architecture can easily get to super-grandmaster levels. This means if someone really wanted to they could probably custom-train an LLM to be good at chess. As I've said below I don't think this possibility was ever really in doubt, to the point that I think it would be too boring for anyone to actually bother. Maybe some traders really thought that transformers could not make good chess evaluations, in which case this paper would be news to them.
@sbares It's a decoder-only transformer in a training setup that is extremely close to LLM pretraining (modulo tokenization and context window considerations). Imagine a long document composed of a bunch of <FEN> <Stockfish evaluation> <Move> <Resulting evaluation> quads, and minimizing autoregressive log loss (exactly the next token prediction setup in LLM pretraining) on that document--that is substantially equivalent to their training setup. They get grandmaster performance with a 270M (!) parameter model trained with on the order of 10B tokens.
The upshot is that if you mix several billion tokens of Stockfish-annotated chess data into the pretraining corpus of an LLM training run, at the scale of models being trained nowadays, they should have more than enough capacity to turn out to be really strong at chess. Model developers will be using several hundred trillion tokens for pretraining, a lot of it synthetic, by 2026, so it's extremely plausible that 0.01% of the corpus would be chess. The resulting system would likely be a strong chess move evaluator on a single forward pass.
Further, that setup is a strong lower bound on how difficult it would be to get good chess performance from an LLM, since it assumes no CoT or effective use of inference-time compute, which is (a) finally being demonstrated to be effective, and (b) plausibly most useful for search-heavy problems like chess. o1 in particular is a pretty striking demonstration that a lot of our assumptions about model weaknesses in reasoning came from pretraining alone doing a poor job of allowing models to generate long, coherent chains of thought. If it seems non-economical to you that standard LLMs will get to grandmaster level with a single forward pass, keep in mind that that is only a sufficient condition for this market to resolve Yes.
@IsaacCarruthers, @AdamK You would both have a point, if the engine were actually any good. As it stands, the claim of "grandmaster-level" play is... very generous at best. In fact, one could see this as probing the location of the barrier against search-less solutions to this particular search problem, in which case one should adjust down if anything, as this barrier seems to be lower than (at least I) expected.
@GabrielTellez normally the rules are that if an illegal move is made, a time penalty is applied, and if it happens 3 times in a game you lose the game. I would assume that the same rules would apply here.
https://www.fide.com/FIDE/handbook/LawsOfChess.pdf
(Article 7.4)
So I see two main ways this could happen (please chime in if you think I'm missing some):
1. The AGI route, where we see such a huge leap forward in reasoning abilities coming out of LLMs that they are able to talk themselves into grandmaster-level reasoning. I'd put this at around 1%.
2. Someone takes a regular LLM and just includes a bunch of chess games in its training data, specifically in order to create an LLM that can play decent chess. I think it would be easy enough to get a ~2000 ELO LLM this way, and probably with some effort you could get one significantly stronger. The reason I think this isn't super likely to happen (~30%), is that it just wouldn't be that interesting. "Oh, you made an LLM that's also an ML model trained to be decent at chess? Cool? I guess?"
I'm also assuming that if someone makes some hybrid LLM where the language portion recruits a separate logic engine for analytical tasks, this would count as "LLM writes code to build a chess engine, and then uses the chess engine" rather than "LLM plays chess", but I'd put this route at 5-10% so I still think this market is high either way.
I would say:
Prompt engineering gets better, such that the LLM isn't closer to AGI but it is able to talk itself into GM level thinking when explicit steps on how to do so are given.
A large number of games are played. The LLM doesn't have to win consistently, just once, if 10,000 games are played between grandmasters and LLMs the LLM is pretty much guaranteed to win at least once.
A combination of weak versions of all/some of these. I don't expect LLMs to reach AGI level by 2028, nor do I expect prompt engineering or more training data to make current LLMs GM level, nor do I expect 10,000 games to be played, but if LLM reasoning gets twice as good, and prompt engineering gets 20% better, and someone includes more games in the training data and 10 games are played, I think there's pretty good odds the LLM will win at least once, and that feels more likely to me.
3 seems basically impossible to me: if the smartest humans alive could not talk themselves into being chess GMs (which I'm pretty sure they can't, at least without also playing thousands of games) then we're not going to see an LLM do it any time soon.
4 seems most likely to come into play as a component of 2, because why would GMs be spending their time playing thousands of games against an LLM unless that LLM was specifically marketed as being good at chess?
I think the most likely path to 2 is something like "OpenAI develops a self-teaching procedure, and has GPT-Next teach itself chess from books and self-play to prove a point." Once we see how much real novelty comes out in the next generation of LLMs I think we'll have a much clearer picture of where things are headed.
@IsaacCarruthers A group of smart humans can write code to implement and train AlphaZero. Given enough time and scratch space, they could also simulate it by hand a la xkcd.com/505/.
So given unbounded runtime/scratch space and clever prompting, the LLM doesn't need to be any good at chess, just good at writing code. And it seems much more likely that someone will spend a few billion train a specialist software-dev LLM vs. a specialist chessplayer LLM.
@placebo_username yes in my top level I mentioned that I was assuming this would count as "LLM builds and then uses a chess engine" rather than "LLM plays chess"
@IsaacCarruthers Not quite. My point is that the logic engine could be implemented by the LLM itself within the language portion instead of being a separate subsystem accessed via queries.
I strongly doubt today's LLMs could beat a super grand master even 1% of the time in hyper bullet. I just timed how long it would take to generate some responses from the gpt-4o api. Here is the transcript:
System: "you are a chess super grand master. you will be provided a chess move and you will say what you think the best follow up is. provide no reasoning or preamble, but only the move."
Me: "e4"
LLM: "e5" (took 2.49 seconds)
Me: "Nf3"
LLM: "Nc6" (took 2.14 seconds)
Me: "Bb5"
LLM: "a6" (took 2.18 seconds)
Me: "Ba4"
LLM: "Nf6" (took 2.47 seconds)
Me: "Nc3"
LLM: "b5" (took 3.66 seconds)
Me: "h3"
LLM: "Be7" (took 2.39 seconds, if it were a 15+0 game, the LLM would have flagged here)
and this is without having it explain it's reasoning or giving it the current board state instead of a list of moves, which I did so that the inputs and responses will be short so it can run quickly, which has the trade-off of meaning it will play much worse than it would if it took the time to "see" the whole board and show some reasoning. unless the problem is that my internet is super slow, I messed up my api calling code, or that I should be using a slower but less accurate model, then no, there is no chance a super gm could ever lose a game against an under 1800 elo opponent who times out on the 6th move, it's just not gonna happen.
I assume you’d use a smaller model for this. Something like LLaMa 8B can get responses significantly faster, right? Fine-tune it on chess data and it could probably get you to 1800 Elo.
(That said I think it’s kind of a cheap way to resolve the market. Computers are obviously faster than humans, and I bet I could make a bot that could beat any human in “extreme hyper bullet chess” with 1 second time for each side. IMO it should be required to be at least 10 minutes per side or something)
@AdamK limit order at 65% for 10k NO shares if you want to go get it.