Did kenshin9000 beat the Stockfish chess engine with GPT-4 by end of January?

190Ṁ515

resolved Jan 30

Resolved

ALL

Resolves YES if kenshin9000 releases code and it decisively beats Stockfish (ver.16). NO if it's end of January and such superiority hasn't been demonstrated.

For this market, it has to run without calling chess engines (or equivalent); game mechanical support libraries like python-chess are allowed, but only to a degree when neither move suggestions nor evaluations are taken from anywhere but GPT-4.

The reference Stockfish opponent shall play under CCRL 40/15 testing conditions, on 1 CPU with 256 MB hash size, but without endgame tablebases:
https://www.computerchess.org.uk/ccrl/4040/about.html

Superiority is defined as either 55% or +2 net wins, whichever is higher, in a set of games with at least 10 decisive (i.e. non-draw) result.

NOTE: criteria to be revised, for some reasonable amount of real money to be spent on verification

Chess

GPT-4 speculation

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ34
2		Ṁ4
3		Ṁ2
4		Ṁ1
5		Ṁ0

People are also trading

Will any GPT beat Stockfish in a fair fight before 2030?

26% chance

By when will LLM chess bots beat other engines? (Permanent)

Will GPT-5 have a rating of at least 2000 in chess?

38% chance

Will my shatranj engine surpass Fairy-Stockfish 14 classical by 2025?

92% chance

Will the GPT4+code-interpreter+search score > 1350 on Lmsys Arena Leaderboard?

49% chance

Which chess engine will be the strongest at the end of 2028?

Which chess engine will be the strongest at the end of 2031?

Will an AI by OpenAI beat a super grandmaster playing chess by 2028?

75% chance

Will GPT-5 score higher than 1350 on the Lmsys Arena Leaderboard

95% chance

Will an open source model beat GPT-4 in 2024?

Sort by:

see his latest postponement

Note that kenshin9000_ is now claiming to have reached "just about 3800" Elo with with Llama2-70B, and expects ~3900 vs SF16 with GPT4 (but also he had failed to deliver anything yet)

See my post in a sister market, with some hilarious background info.

NOTE I may have to revise the resolution criteria, as kenshin9000_ is now quoting excessively high cost of running his engine code (>$50/game!). Unfortunately, given the vagueness of his proclamations, this cannot yet be pinned down. I'll be glad to receive suggestions from market participants!

An interesting tidbit from the latest xeet by kenshin9000_:
"The evaluation function I currently have was updated through ~11000 games "

This really is a miniscule training set, especially for such a contraption as the supposedly NLP-assisted weight determination his scheme ostensibly uses. For comparison, the main repository of test games for Stockfish has close to 7 billion (yes, with a B) games.