Will a large language model beat a super grandmaster playing chess by 2028?
Basic
1.2k
496k
2029
55%
chance

If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.

I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)

Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.

1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.


2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.

Get Ṁ600 play money
Sort by:
bought Ṁ200 YES

I've updated more strongly towards Yes here given the recent AlphaProof result.

https://www.scientificamerican.com/article/ai-will-become-mathematicians-co-pilot/

I think a major step up wasn't in just AI capabilities, but the underlying mathematics libraries that allow computers to interact with math and test out proofs.

I think that proving frameworks essentially bring math to the same level as chess i.e. both can now be "checked".

The other part is an LLM needs to be able to "explore" the solution space effectively. The fact that an LLM can do this for math makes me more inclined to believe it can do it for chess

But the LLM can't use another AI to "check" its answer, because the checker would be just a regular chess engine, and we know those beat GMs. It'd have to do the exploration and the checking itself. I think a LLM can do a decent job exploring, I don't think they'll be able to really check, at least not by 2028.

And part of what motivates here is that if they can- then we've reached singularity, it's AGI that can do anything. So mana will be completely pointless.

The "check" step happens during reinforcement learning. The LLM then learns to check on it's own.

I'm imagining something like the 2nd figure in this graphic: https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

I think this is about as far from singularity as getting IMO gold is. I dunno.

>The LLM then learns to check on it's own.

My understanding is that it does not, it uses the programming language Lean to write out formal proofs, and when you compile Lean, the compiler tells if it's a valid proof or not. So the LLM spits out millions of candidate solutions and hopes one of them is correct. That's still very impressive, but not nearly as impressive as an AI that could perform in the IMO without using Lean or other formal proof programming languages.

I feel like people are forgetting that AI already beats humans in chess. This isn’t about if AI can beat humans in chess, this is about if an LLM can beat a human in chess without outside help.

opened a Ṁ1,000 NO at 60% order

@AdamK limit order at 60% for 2.5k NO shares if you want to go get it

@MP sorry if this is already clear from pre-existing comments or description but to clarify/confirm, would an otherwise general intelligence model fine-tuned for chess count?

This market is heavily influenced by whether we use the term "LLM" to describe future search-based transformer architectures. An architecture that doesn't utilize search seems unlikely to reach grandmaster strength.

I am 1800 on lichess. I tried couple times to pronounce every bit of the board that affects my decision (not using intuition, but pure reasoning, describing the current strategic function of each piece, bruteforcing with words). The sheet of text for each move (with my 3move intuition deepfirst search) would become bigger than the volume with which chatgpt4 is able to be consistent (i use it daily, and i see how some sentences outweight old ones, but all branches should be considered equally until definitely shown, that they are bad).

The problem is that most books on chess use notation, and instead of discussing every intermediate move they just provide a sequence of "obvious" moves only for "promising" branches. (Obvious to human, who have geometrical visualisation and spacial patterns recognition).

I think it would be possible to increase the progress in this field, if sb created "a Wordify" first - some program which uses mix of engine and LLM to create example literature, which will then be used as training data for a pure LLM.

Exercise

How would you wordify reasoning for this example?

(Board has to be described first. It is impossible to say that for example one piece blocks another until it is logically deduced by analising indices)

bought Ṁ100 YES

If you're given 3 seconds to make a move, your reasoning for each move is probably "It looks right". There is a wide distribution of ability in human players for such fast games.

Consider you could make the same "current strategic function of each word" for writing a sentence.

Just as LLMs can write working code and not really explain why/how it works (or why its idiomatic), they should be able to play winning games while being comparatively weak at explaining their moves.

lichess game data is open, and it's very easy to generate datasets for this, a self play setup is possible, and there is probably some generalisation from an LLM being good at chess to make it likely to end up in training corpuses. It has also became somewhat of a benchmark, which tend to be goodharted

GPT-4o suggests trading rooks, thinks it forces mate (it does not). Stockfish agrees this is best move but just uses it to trade down materials and harass the king.

GPT-4o:

In the given position, it is White's turn to move. Let's analyze the position:

1. Material: Both sides appear to have equal material. Each side has a queen, a rook, a bishop, and several pawns.

2. King Safety: White's king is on c1 and is relatively safe, although it is somewhat exposed due to the pawn structure. Black's king is castled on the queenside but lacks pawn cover on the a-file.

3. Piece Activity:

- White's queen is on a4, aiming at the weak a6 pawn.

- White's rook on d1 controls the open d-file.

- White's bishop on e3 and knight on f3 are both well-placed.

- Black's queen on e6 and bishop on c4 are active, but Black's rook on h8 and bishop on f8 are still passive.

Given these observations, White has a significant tactical opportunity with the move:

1. Rxd8+ Kxd8

After this exchange, White can further take advantage by playing:

2. Rd1+ Kc8 (forced)

Now, White has a decisive move:

3. Qd7+ Kb8

This sequence of moves leads to a winning position for White because the Black king is forced into a very vulnerable position. The final move after this sequence could be:

4. Qd8+

This move leads to a checkmate.

Therefore, the best move for White in this position is:

1. Rxd8+

It does exactly what i said about chess books: it named (tried to guess) a single line and was discussing only it, instead of any actual reasoning and choosing process.

To prove that f(Y) is the max value one needs to show that f(X)<f(Y) for any X!=Y. That is what both people (at least above 1500elo in classical matches) and stockfish do, but LLMs have not learned it yet.

My intuition is that the computational complexity between understanding the rules of chess and not at all is much greater than the ability to play it poorly or at the highest level.

LLMs can sort of reason, not efficiently or the same way as humans perhaps, but there is more than one way to skin a cat. If you think they can be made to consistently follow the rules and basic strategies, following deeper strategic rules shouldn’t be that far behind.

That being said the question over the premise of consistent rule following and will and ability to make a LLM with a memory architecture suitable to this task by 2028 makes this market fairly priced imo. But in principle if an LLM is powerful enough I don’t see why it couldn’t play chess very well.

The uncertainty in this market is not about whether an LLM as we know them today will be better than super grandmasters (elo > 2700), but some combination of:

  • Will a significantly changed AI that can more plausibly do this (e.g., with some sort of deep search—whether done in "text" or another format) be known as an LLM?

  • Will there be enough games played between such AIs and SGMs that this will happen once by chance?

I highly recommend all the yes bettors here actually play a game against current sota models.

Mine went like this, it's nowhere near 1800, it's well under 1000. this is Gemini 1.5 Pro API. It tried 10 illegal moves. the game ends because it resigned. It can only play the opening with any sort of competence and that might give a decent impression, but the second you're out of theory, it's blunders all the way down

yeah, but I think most YES bettors are thinking about where the technology will be in late 2028, not where it is now. AI seems to be advancing at a pretty impressive rate right now, I don't think it's unreasonable to assume it could play chess at a decently high level in a few years. The YES bettors who do think it could happen with current sota models seem to at least be talking about fine-tuning or prompt engineering or other tricks that it seems like you didn't use. I don't think there's any confusion about its current competence.

There's a massive difference between a "decently high level" and "grandmaster" level

For humans, there's a massive difference between a 'decently high level' and 'grandmaster' level. For AI, no. It took less than 4 years for an AI to achieve a grandmaster level from a near-zero level in every complex game, like Go or poker. I don't understand why it would be different for LLMs.

I highly recommend playing against GPT-3.5 Turbo Instruct. It's better at chess than recent models.

https://nicholas.carlini.com/writing/2023/chess-llm.html

I challenged it and waited for 15 min. nothing happened. Is it still active?

Last game was 9 months ago. Must be inactive.

There really is no reason to assume 3.5 turbo is better than current sota. That's what Carlini had available at the time.

I'm in middle of making a wrapper along the lines of what he made, Is the whole point the PGN structure? is there anything else I should be adding?

Sorry Fergus, I only read the article without checking the game 😅

The question is not if 3.5 turbo instruct is better, but why, and it's probably because there were PNG files in the training set. So yes, the PNG structure is the whole point.