Currently Llama-4-Maverick is sitting at #2 in Arena (https://lmarena.ai/?leaderboard). However, overall feedback about Maverick does not reflect this, and the examples released (see: https://x.com/lmarena_ai/status/1909397817434816562 ) are sometimes baffling as to how Maverick got chosen. There are accusations of cheating.
So, did they outright cheat, rather than merely Goodhart?
Resolves to YES if at any point I am 95%+ confident that Meta cheated, or someone cheated on their behalf, in a way that impacted their ordinal ranking.
Resolves to NO if at any point I am 95%+ confident that Meta did NOT cheat, and no one else cheated on their behalf, in a way that impacted their ordinal ranking.
If neither occurs within a year, this resolves to my probability that this was the result of cheating, with a strong prior towards fair market price. If this market gets big enough to care and a better resolution mechanism with the same goal is suggested, I might switch to a different rule here prior to 7/1/25.
WARNING: SUBJECTIVE EVALUATION MARKET if evidence is not definitive. I don't know any other way to offer this market, and I WILL NOT BE ARGUING ABOUT THAT unless someone wants to pay my hourly (hint: don't do that).
Update 2025-04-08 (PST) (AI summary of creator comment): Different version use is not, by itself, sufficient for cheating.
If Meta used a different version of the model solely for the arena purposes, that does not meet the bar for cheating.
There must be additional evidence of misconduct beyond using a different model version that affected their ranking.
The resolution will require that more than just a version change be evident before concluding that cheating occurred.
If Meta used a specific model for the arena which is different from the model they used for their other benchmarks, would that count as cheating?
(This is the situation, I think. Ref: https://www.youtube.com/watch?v=tjJxaqKIk9w )
@YonatanCale If they used a different version of the model that alone is NOT the bar for cheating here - I want to know if MORE than that happened.
@ZviMowshowitz Good thing Yonatan checked. This is what it seems like they did, and I would absolutely call this lying and cheating on lmsys. I guess you'd also have to think they did something like swung the score with a bunch of bots or mturkers, or literally falsified the data with a cyberattack to go beyond "they said they were testing Model A, but actually were testing importantly different Model B designed specifically to perform well on just this test."
@NathanHelmBurger There are explicit claims they used watermarks to have users identify and rate outputs, yes. That's what I was interested in.