
If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
if it can regurgitate chess engine code from its training set and then run that code, ok, but that doesn't really count as an LLM playing chess
To confirm re: “while playing blind chess”
for the match to qualify, the grandmaster must agree to a https://en.m.wikipedia.org/wiki/Blindfold_chess match, correct?
@MP This is a great way to handle the imbalance of language models not being able to "see" the board. So kudos on that.
Since GPT-4 will soon be released to support images, I also have a market on whether it will be able to beat my roommate(who someone estimated at 600 ELO) when given renderings of the chess board as input. It lost the text-only game shortly after it blundered shortly after being taken out of the standard opening, not seeing a threat.
GPT-5 is also likely to be multimodal.
For chess specific capabilities, I see there's some chess benchmarks in the OpenAI evals repo, so maybe a little bit of training on puzzles or similar will be enough to get a better model of the game. Open-source models will be finetuned on Chess, but those won't count. It's possible OpenAI's evals are used to train other models, and the chess capabilities will carry over.
So, considering the grandmaster will be blindfolded, and chess-specific training is getting into the latest GPTs(but not enough to make it a chess-specific model), I'm willing to buy this up somewhat.
The other risk besides capability is if nobody bothers to play a blind chess game. But it looks like it's at least somewhat popular, and all it takes is one game. So as long as the model performs somewhat okay against good players, someone will try it.
@Mira https://youtu.be/W6jkLKo8To0
The idea for the market came from this youtuber who is a national master

My uncertainty in this market largely hinges on the definition of an LLM. I don't think it's theoretically impossible for an LLM to beat a grandmaster at chess, but I think that the scale required would be absurd, and model architectures are likely to change by 2028 anyway so the likelihood anyone trains a current LLM the scale that would be required is minimal. However, exactly how model architectures change, and whether those changes are still widely referred to as LLMs is what I'm unsure of.
@Weepinbell If LLMs can't beat Caruana in chess, so it's very unlikely LLMs will iterate until they are AGIs

I'm parroting Yan LeCun a bit here with some editorialization...unless the widely accepted definition of LLM's significantly changes, GPT-based LLM's of today's understanding of the world is limited to their understanding of language. Their understanding of physical reality and logic is an illusion. Contrast this to chess engines and people, whose understanding of the rules of chess comes from actual learned experience of the game itself.
Further, I have attempted to, "play chess," in different language models / ChatGPT about a month ago and it doesn't even get the algebraic notation (AN) correct yet. No one seems to have paid that much attention to chess AN, e.g. perhaps there wasn't sufficient AN in the training set, because the mistakes it makes are far worse than say, when it generates python.
So if people aren't using LLM's to play chess and submitting reinforcement, there's not much chance for its capability to wrap around AN to improve right out of the gate. There's not a huge commercial application.
Once it does get AN figured out, there's going to need to be a sincere effort to train an LLM specifically on chess games and logic.
So when the market says, "no plugins allowed," ... I read that as, "no filtering allowed / no ensemble models with a tree structure allowed." What about vector embeddings? What about feed forward algorithms? This is why I say, "the definition of an LLM can't change significantly." I think the definition needs to be accepted as, "an LLM in the same form, with more or different training data, more parameters."
Other markets could be put together for solving chess with other technologies.
@PatrickDelaney I am here ruling out models that are specialized on chess, or that have capabilities specially tailor made for chess.

@MP You are excluding changing the architecture / design / training procedure of the model correct? What about just fine-tuning the model on chess trajectories? Like giving a language pre-trained model a bunch of chess games and having it learn the games in algebraic notation as it does language (i.e. imitation learning)?
@MP Will wins in blitz (/other limited time formats) be counted? Does anyone know how much worse (in terms of base game elo) players are under blitz time constraints?
@firstuserhere okay description answered this.
" it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database."

Large Language Models by their very nature are not capable of dealing with complex state like chess.
Language Models are good for text generation, but not for problem solving.
It seems many people are not understanding difference between Neural Models and LLMs. Neural Model to beat 2700 player by 2028? - Easy. LLM to beat player by 2028? - Good luck not making illegal moves, let alone win 2700 player.
@DmytroBulatov LLMs are very much like human beings. It can reason and have intuiton. Believe or not, Carlsen isn't a machine
@MP No they don't. That's the thing, LLMs are not general-purpose artificial intelligence. They have specific task to solve - generate text. They are not trained to validate information and be correct, they are trained to generate text like humans do.
Current models can't even reliably calculate where chess pieces will be on the board after specified moves. And reason for that, is because they are not trained to remember or calculate anything.

@DmytroBulatov I am betting NO, but you are incorrect regarding the state of GPT-4's chess ability. It extremely rarely makes illegal moves, and mostly makes pretty good moves, even in board states that have never existed in history. I estimate it's around 1200 Elo.

@SemioticRivalry How are you playing it? This has not been my experience with GPT
@DmytroBulatov well if X can model Y and Y can model Z, can X model X?
X can be an LLM, Y can be natural language, and Z can be the physical world.
If Y can model Z, aka language can model the world, and X can model Y, aka a system can model language, does the system (in theoretical bounds) also become capable of modeling the physical world?
@firstuserhere remember that language captures the true realities of the physical world far more often than fabricated realities of the World. Language is capable of modeling both real and false/imaginative worlds. The thing to know is that humans are quite good at separating the two, while LLMs are currently not, and demonstrate "hallucination".
However, fundamentally, what are neural networks capable of? Separating signal from noise in the data when both the signal and noise look similar to us.
It is not an imaginative leap to see systems capable of "grounding" themselves and filtering out real world descriptions and imaginative world descriptions
@SemioticRivalry That's just proves my point. It can't correctly track state of the game. It does illegal moves, even if GPT-4 does them less than GPT-3. When game becomes longer, it will inevitably, if not make an illegal move, then at least "misremember" state of some pieces in the game.
It's all goes down to the way it trains. It doesn't train to become good at chess, it trains to generate text. Humans can't get good at chess by only reading (sometimes incorrect) text online, and practicing writing text. If it doesn't actually gets trained specifically on chess games, then I don't see feasible way for LLM to improve to levels way higher than average player.
My point is - LLMs are good and all and neural networks can definitely get 2700 ELO, but LLMs are just not the tool for the job here.
@DmytroBulatov what about fine-tuning on chess notation,books, online games have the entire games in text format mainly, grandmasters learn moves abstractly not by physically playing games but simulating them also

@Mira If a human child required all the helping hands you are giving GPT, I would say that child doesn't know how to play chess.

@DmytroBulatov It's a probabilistic model- it will never be 100% correct at anything, but it has a very high success rate at chess. I've had it give >500 moves and have got very few (<5%) illegal moves.
@ForrestTaylor I simply prompt it with a game and tell it to complete. Sometimes it actually finishes it, sometimes it gives like 20 moves, but it's extremely rare that it makes illogical or impossible moves. Here's my first attempt:

The first move is a big mistake, but it's a very human one- failing to see the potential pin from a bishop. Then it takes advantage of its own mistake by pinning the queen and winning it. Most of these moves are very good and a lot are even the perfect stockfish move.. Every single one of these moves is not only possible but makes sense in the context of the game, although there are certainly a few mistakes.

@MP "LLMs are very much like human beings." Please check out my markets. They are not very much like human beings, they approximate knowledge. We can measure how frequently these approximations surpass human capability in different areas, but this does not mean they have a real understanding of the world. They are indexing language to describe the world...big difference.
@PatrickDelaney I also don't have a real understanding of the world, I also approximate knowledge. Read the sequency.
To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
Note the constraint “while playing blind chess”. Most games won’t qualify since grandmasters don’t play that much blind chess.

“They don’t know that stockfish is a literal neural network—and has an 800 elo advantage”
(This could be distilled into a “language model” that would run on a 5yo cell phone and still crush GMs)

AI's chess prowess on the rise,
Stockfish taunts with every surprise,
Pity the grandmaster, in demise,
Large language models win the prize.
What does "large language model" mean? Hypothetically, if I found a Markov Chain that produces winning chess moves with high probability, would that count even if it isn't "large" or resembling GPT-3? Does it have to be transformer-based?
If I train a chess engine using transformers and at every step it emits the next steps of a tree search algorithm, and can be iterated much like ChatGPT can be stepped with its finite context to simulate programs, does that count as a "language model"?
Does the language model have to published by a company and marketed for a purpose other than games? Does it require a minimum capital investment(a possible definition of "large")? Minimum number of parameters? What if the chess capabilities work with a small number of parameters, but somebody grafts some useless ones on just to satisfy the requirement of being large?













