Meta cheated at LM Arena to pump Llama-4's score?

1kṀ6225

2026

11%

chance

ALL

Currently Llama-4-Maverick is sitting at #2 in Arena (https://lmarena.ai/?leaderboard). However, overall feedback about Maverick does not reflect this, and the examples released (see: https://x.com/lmarena_ai/status/1909397817434816562 ) are sometimes baffling as to how Maverick got chosen. There are accusations of cheating.

So, did they outright cheat, rather than merely Goodhart?

Resolves to YES if at any point I am 95%+ confident that Meta cheated, or someone cheated on their behalf, in a way that impacted their ordinal ranking.

Resolves to NO if at any point I am 95%+ confident that Meta did NOT cheat, and no one else cheated on their behalf, in a way that impacted their ordinal ranking.

If neither occurs within a year, this resolves to my probability that this was the result of cheating, with a strong prior towards fair market price. If this market gets big enough to care and a better resolution mechanism with the same goal is suggested, I might switch to a different rule here prior to 7/1/25.

WARNING: SUBJECTIVE EVALUATION MARKET if evidence is not definitive. I don't know any other way to offer this market, and I WILL NOT BE ARGUING ABOUT THAT unless someone wants to pay my hourly (hint: don't do that).

Update 2025-04-08 (PST) (AI summary of creator comment): Different version use is not, by itself, sufficient for cheating.
- If Meta used a different version of the model solely for the arena purposes, that does not meet the bar for cheating.
- There must be additional evidence of misconduct beyond using a different model version that affected their ranking.
- The resolution will require that more than just a version change be evident before concluding that cheating occurred.

Get

1,000

to start trading!

People are also trading

Was Llama 4 sabotaged?

13% chance

Llama 5 outperforms GPT 4o on LM Arena?

Sort by:

sold Ṁ58 YES

Well seems like they used different version. HF version was added now and it's around 150 elo below, 32th place. I think this is cheating but doesn't fit market criteria. I don't think they needed other ways to cheat, special tuned model is probably enough

bought Ṁ3,000 NO

I'll bet a lot on NO rn if anyone wants to bet larger amounts

If Meta used a specific model for the arena which is different from the model they used for their other benchmarks, would that count as cheating?

(This is the situation, I think. Ref: https://www.youtube.com/watch?v=tjJxaqKIk9w )

Major Llama DRAMA

Join My Newsletter for Regular AI Updates 👇🏼https://forwardfuture.aiMy Links 🔗👉🏻 Subscribe: https://www.youtube.com/@matthew_berman👉🏻 Twitter: https:/...

@YonatanCale If they used a different version of the model that alone is NOT the bar for cheating here - I want to know if MORE than that happened.

@ZviMowshowitz Good thing Yonatan checked. This is what it seems like they did, and I would absolutely call this lying and cheating on lmsys. I guess you'd also have to think they did something like swung the score with a bunch of bots or mturkers, or literally falsified the data with a cyberattack to go beyond "they said they were testing Model A, but actually were testing importantly different Model B designed specifically to perform well on just this test."

@NathanHelmBurger There are explicit claims they used watermarks to have users identify and rate outputs, yes. That's what I was interested in.

What do you think right now?

People are also trading

Was Llama 4 sabotaged?

13% chance

Llama 5 outperforms GPT 4o on LM Arena?

85% chance

People are also trading

People are also trading

Related questions