Skip to main content
MANIFOLD
Meta cheated at LM Arena to pump Llama-4's score?
36
Ṁ1kṀ7.7k
resolved Apr 9
Resolved
NO

Currently Llama-4-Maverick is sitting at #2 in Arena (https://lmarena.ai/?leaderboard). However, overall feedback about Maverick does not reflect this, and the examples released (see: https://x.com/lmarena_ai/status/1909397817434816562 ) are sometimes baffling as to how Maverick got chosen. There are accusations of cheating.

So, did they outright cheat, rather than merely Goodhart?

Resolves to YES if at any point I am 95%+ confident that Meta cheated, or someone cheated on their behalf, in a way that impacted their ordinal ranking.

Resolves to NO if at any point I am 95%+ confident that Meta did NOT cheat, and no one else cheated on their behalf, in a way that impacted their ordinal ranking.

If neither occurs within a year, this resolves to my probability that this was the result of cheating, with a strong prior towards fair market price. If this market gets big enough to care and a better resolution mechanism with the same goal is suggested, I might switch to a different rule here prior to 7/1/25.

WARNING: SUBJECTIVE EVALUATION MARKET if evidence is not definitive. I don't know any other way to offer this market, and I WILL NOT BE ARGUING ABOUT THAT unless someone wants to pay my hourly (hint: don't do that).

  • Update 2025-04-08 (PST) (AI summary of creator comment): Different version use is not, by itself, sufficient for cheating.

    • If Meta used a different version of the model solely for the arena purposes, that does not meet the bar for cheating.

    • There must be additional evidence of misconduct beyond using a different model version that affected their ranking.

    • The resolution will require that more than just a version change be evident before concluding that cheating occurred.

  • Update 2026-04-08 (PST) (AI summary of creator comment): The creator indicates this market is likely to resolve NO. Meta has confessed to using different model versions for Arena, but per the resolution criteria, using a different model version alone does not constitute cheating. Resolution will remain NO unless a strong argument is made for why it should be otherwise.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ783
2Ṁ80
3Ṁ75
4Ṁ68
5Ṁ65
Sort by:

My understanding of the situation here is that they have confessed to using different models, but that by market criteria that alone is not counting as cheating. So this is going to resolve NO unless someone comes up with a strong argument why it shouldn't.

@ZviMowshowitz how is using different models not equivalent to gaming the system, i.e. cheating?

@Hakari I already decided last year that this would not constitute sufficient grounds, see the description of the market.

@ZviMowshowitz oh well.. I should have read the comments..

sold Ṁ58 YES

Well seems like they used different version. HF version was added now and it's around 150 elo below, 32th place. I think this is cheating but doesn't fit market criteria. I don't think they needed other ways to cheat, special tuned model is probably enough

Related:

bought Ṁ3,000 NO

I'll bet a lot on NO rn if anyone wants to bet larger amounts

If Meta used a specific model for the arena which is different from the model they used for their other benchmarks, would that count as cheating?

(This is the situation, I think. Ref: https://www.youtube.com/watch?v=tjJxaqKIk9w )

@YonatanCale If they used a different version of the model that alone is NOT the bar for cheating here - I want to know if MORE than that happened.

@ZviMowshowitz Good thing Yonatan checked. This is what it seems like they did, and I would absolutely call this lying and cheating on lmsys. I guess you'd also have to think they did something like swung the score with a bunch of bots or mturkers, or literally falsified the data with a cyberattack to go beyond "they said they were testing Model A, but actually were testing importantly different Model B designed specifically to perform well on just this test."

@NathanHelmBurger There are explicit claims they used watermarks to have users identify and rate outputs, yes. That's what I was interested in.

What do you think right now?