By when will Kenshin9000 (or anyone else) “defeat all chess bots” using LLMs? (Permanent)
Basic
102
352k
2100
4%
2024, by Election Day
4%
2025 or earlier
11%
2026 or earlier
11%
2027 or earlier
24%
2028 or earlier
25%
2029 or earlier
18%
2030 or earlier
30%
2040 or earlier
37%
2050 or earlier
48%
2100 or earlier

This market resolves each option as NO if the date passes and Kenshin9000 (or anyone) has not defeated stockfish with an LLM-based chess engine.

All remaining options resolve YES once an LLM-based engine defeats stockfish (or top engine).

My resolution criteria are more strict than Mira’s:

  1. The LLM engine must have higher ELO than the latest stockfish (or whatever the top engine is at resolution time) at blitz timings with 99.9% confidence and be reproduced by 3+ people.

  2. The LLM engine must not use another chess engine at runtime.

For the purposes of this market, Large Language Models are 100M+ parameter general-purpose generative text models. A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data. An LLM-based engine may use search, but node evaluation must be performed by invoking the LLM on each node (similar to AlphaZero, which is a DNN+search).

The LLM engine and Stockfish will run on the same hardware with the same time controls. The testing hardware should be either a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards.

Get Ṁ1,000 play money
Sort by:

I did some testing with o1, but it fails at pretty simple puzzles.

He's solving the ARC AGI challenge in 2 weeks, so get your bets in:

Regarding "A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data":

I assume that if it's fine-tuned on 99% chess data and 1% something else it still wouldn't count? Do you mean, it cannot be trained on any data significantly biased towards chess?

I understand it as "the model must have/retain substantial natural language capabilities"

So a Gato-style hybrid of chess engine and chatbot would qualify?? That's a much weaker condition than how I understood the intent.

Note that there are other conditions that rule out bundling a chess engine with an LLM. In fact the condition is IMHO quite strict. If you have something that plays chess and is also a language model, you almost certainly can improve chess performance by sacrificing language. So the market requires that a) it is possible to improve chess state of the art with LLMs and b) someone publishes such an LLM before LLM-derived, chess-specialized technology becomes the new state of the art in chess engines (because the comparison is always against the state of the art engine)

Gato is not "bundling". You train a model to do both chess position evaluation and text prediction (e.g. each task makes half of the training set), it's obviously doable. I guess your interpretation of the question is: can we show an instance where the language ability makes chess ability at least a little better, rather than worse. It's a valid question, but much weaker than what I understood the question to be. It would be nice if market creator chimes in on this.

@someonec5dd Are you sure it is reasonable to bet "2024, by election day" substantially higher than "2025 or earlier"? Thanks for the free mana though...

brother u aint beatin alpha zero with an llm anytime soon 😭😭😭

Who TF is buying YES on "before election day"? Am I missing some kind of joke?

A) No reason for "before election day" to be higher than "2025 or earlier" and

B) The resolution criteria are very strict - there's very little computation you can do with an LLM on "a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards" in "Blitz time controls".

Also no real reason for non-search methods to beat search at any point as Chess fundamentally is search, but that's a different question...

why tf are anyone betting YES on any of the dates...

2024, by Election Day
opened a Ṁ550 2024, by Election Day YES at 11% order

sell order at 11%

opened a Ṁ25,000 2030 or earlier NO at 26% order

stockfish already uses a deep neural net to evaluate terminal nodes in the search. But it is specialized for chess. A net trained on internet text will never be competitive with the specialized one.

Actually, Stockfish doesn’t have a deep NN, it uses NNUE, I think with like four layers these days. LeelaChessZero does have a deep net so your point still stands (and it prolly would even if Leela didn’t have a deep net)

But, concept pinning!

Interesting error, maybe gone after my next trade. 😁

bought Ṁ25 2100 or earlier NO

Doing my bit to stop this absurdity

Maybe the question should say stockfish not "all chess bots"

@Daniel_MC Nope, see the description. Each option resolves against the top engine at the time.

Ah my bad

opened a Ṁ500 2024, by Election Day NO at 10% order

@someonec5dd very confident in Kenshin...

No incentive for long term bets anymore, but I think this market should be much lower

@someonec5dd the jig is up, no more free Mana 📉

@jgyou what?

@jgyou lol, that's not the end of free mana. More like the end of loans.

Comment hidden