This market resolves each option as NO if the date passes and Kenshin9000 (or anyone) has not defeated stockfish with an LLM-based chess engine.
All remaining options resolve YES once an LLM-based engine defeats stockfish (or top engine).
My resolution criteria are more strict than Mira’s:
The LLM engine must have higher ELO than the latest stockfish (or whatever the top engine is at resolution time) at blitz timings with 99.9% confidence and be reproduced by 3+ people.
The LLM engine must not use another chess engine at runtime.
For the purposes of this market, Large Language Models are 100M+ parameter general-purpose generative text models. A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data. An LLM-based engine may use search, but node evaluation must be performed by invoking the LLM on each node (similar to AlphaZero, which is a DNN+search).
The LLM engine and Stockfish will run on the same hardware with the same time controls. The testing hardware should be either a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards.
Related questions
@Weezing An LLM engine is a chess engine which uses an LLM for node evaluation. It may still use search, but can’t use a non-LLM evaluation function.
@Paul And what is LLM in this context? Can it be trained just on chess specific text (for example chess notation)? Or just generic text?
@Weezing Great question. I’d say that it has to be a /language/ model, meaning general purpose, not chess-only training. A fine tune of a general purpose language model is fine, but a chess-only transformer model is not.
@Paul Wait a sec, this is completely different than what I thought the market was about when I bet! I thought we were betting on whether a LLM by itself could defeat stockfish, not a search engine that uses an LLM just for node eval. I wouldn't think of that as an LLM engine.
Like, taking AlphaGo as an example, it uses a neural net to direct the monte carlo tree search, so it's like half a neural net engine - the other half being the monte carlo tree search of course, which is also crucial to its success. I think calling AlphaGo a "neural network engine" would still be misleading. But using an LLM just for node eval is far less an LLM engine than AlphaGo is a neural network engine.
Also, what's stopping someone from just running the LLM engine run with a ton more compute = more depth than stockfish and "winning" that way? Are you requiring that they use the same amount of compute?
Btw I think the question of whether LLM+search can beat stockfish is much more interesting (because it's more plausible to actually happen), I just think it's extremely unclear from the question description.
@jack thanks for the feedback. I have updated the description to clarify the engine definitions and hardware/timing constraints.