I will calculate an Elo score for my model and resolve to it as the linear interpolation of surrounding entries.
Assume that Other has points at every 500 Elo points. I'll split it off if people really think it'll get high scores.
Architecture will be a simple transformer model, but I would put most of my effort into the data/reward. Curriculum learning with chess puzzles, reinforcement learning, self-play tournaments, etc.
Validation Elo will be calculated by playing random matches against a population of Stockfish settings at varying Elos until there has been 50 games since its previous all-time high. Stockfish has an "UCI_Elo" configuration that will likely be used. The average of the 50 games succeeding the all-time high will be used to resolve this market.
I am allowed to do legality checking. If my model gets fewer than 5% illegal moves, I would likely do legality checking for it(resampling once or twice) so I can test its play at higher ratings. But if it generates a higher rate of illegal moves, those games will count as losses.
@Mira won't trade in this market, and will sell at market if I accidentally buy some shares.
@Butanium Mira's account is deleted now, so I think the mods may just N/A this if we don't hear anything.
I'm canceling a bunch of personal markets because real money trading with the pivot means they aren't really suitable.
But this one has reasonable number of traders and is decently defined, so I'll leave it alone for now.
As a project update: I'm doing LLM agent stuff for now, so I haven't started this. Remember that it does resolve 0 if I don't do it, unless the admins cancel it. I still intend to do it: It's not that much setup.
https://arxiv.org/abs/2402.04494
TLDR:
... we train a 270M parameter transformer model [on chess]
Lichess blitz Elo of 2895 against humans
https://fxtwitter.com/a_karvonen/status/1743666230127411389?t=cJ8a04FFA9yZzRDCndYIBw&s=19
50M parameters gpt has 50% winrate against stockfish 1500
This guy claims a single dense layer can get 2k Lichess rating(not ELO):
https://github.com/thomasahle/fastchess
He's adding MCTS though the model itself with no search is claimed to be usable.
Very neat idea! I suggest you tie this to a specific method of Elo calculation (including that of the number of games played), otherwise the result is very ill defined.
In particular, I consider the suggestion of random moving opponent a poor choice. The Elo of that opponent is very low (many SDs away from that expected from an agent trained on existing PGN databases), and also estimatable only with large uncertainty. Human players are far from ideal, too. My suggestion is to use Stockfish at low skill setting (but with large hash and long thinking time, so that the nominal Elo is actually achieved), e.g. set to around 1000 Elo or so.
@Zozo001CoN Good suggestions. I'll use a population of Stockfishes set to various Elos, will play until the all-time high ELO of my model stops improving(for 50 games), and will take the average of the 50 games succeeding that all-time high. I'll randomly generate opponents within a 250 ELO range centered at its current rating.
Most free online chess sites use Glicko not Elo, so my fallback of playing some online games for a rating might not work so cleanly Some possibilities:
Elo vs. the agent that picks a move uniformly at random
Win rate vs. an engine + settings with a tested Elo(calibrated against chess engines)
Win rate vs. an engine + settings with a tested Elo(calibrated against human players)
Glicko/Glicko 2 converted to approximate Elo
Win rate relative to myself
I might report on all of these as needed. For training purposes, the "elo vs. the agent that moves randomly" would likely be my main metric. This is mathematically a clean metric, but traders here probably prefer to bet on human-calibrated scores.
For resolving, assuming it's not 0, I'll likely find several engines + settings with Elos calibrated against human players closest to my model, play them randomly, and calculate an Elo while keeping the other engines fixed. There's also an "implied Elo" if I calculate a win rate vs. a known Elo, that I can report for each engine in the population.
@Mira > Most free online chess sites use Glicko not Elo
With respect to the online bots, it is important to note (even besides them not using the standard Elo as you noted) that their self-proclaimed ratings are unverified AND likely overstated.
A particularly notable example in this context is the 'gpt35-turbo-instruct' bot on Lichess, boasting a 2350 "provisional" blitz rating with an actual strength likely below 1800 Elo.
@Mira If it makes an illegal move does it lose automatically or is there some tolerance?
@Weezing Some kind of explicit reward signal for making legal moves will be part of the training so hopefully that's unlikely. I might sample it a few times until I get a legal move if it's a rare event(<5%), but if it's producing illegal moves all the time I'll just count them as losses and resolve 0. It won't be able to rely on error-correction in the normal course of playing.
@Fern For "grammar constraint": It won't be a typical LLM. It will be a transformer network. Could be restricted to FEN strings, but could also be purely in embedding space with something that knows how to encode entire chess boards into embedding vectors. I haven't decided how I want to represent the choice of move, and some representations make illegal moves much less likely.
@Mira I think you could have a chess engine running alongside of it and construct a 'plausible move grammar' from that, at least during training.
To prevent a combinatorial explosion in the tokenizer, you can have source and dest locations be a token tuple, sometimes this is more trouble than it's worth though.... (not sure which is best here, honestly).
One could also have a single token for every position in the 64x64 space but your embedding space would have to be at least 4096 to avoid collision issues.
You could also have a supervised legality loss that provides a strong signal as to which moves are legal each turn, this could be especially effective under a 64*64 tokenized scheme, and is basically 'free data' for the network each step and would be much more efficient than a motif where we implicitly learn solely from moves alone what is legal, and what is not.