resolved May 19

This market will resolve YES if, by the end of June 2024, Elon Musk's xAI announces that they have a language model at least as powerful as GPT-3.5 or Claude.

By default, I will use the Arena Elo rating to decide whether a model meets the bar. If there is no such rating, I will use other benchmarks (e.g., MMLU) or my subjective impression. If there is a lot of disagreement, I will resolve NA.

Get Ṁ600 play money

🏅 Top traders

#NameTotal profit
Sort by:


Anyone objecting to resolving YES? It seems like it has a good MMLU score, and it doesn't look like it'll appear on the LMSYS Chatbot Arena Leaderboard anytime soon.

sold Ṁ85 NO

@JonasVollmer I was the biggest NO holder; this seems fair

predicted NO

Why do you think it still hasn't been added to the arena? It also hasn't received an independent MMLU evaluation side by side with 3.5? They've already added the latest Mistral model to the arena even though it came out months later? I'm still a large NO holder (disclosure).

predicted YES

@benshindel Yeah IDK, seems weird

predicted YES

Any reasons against resolving this YES, based on all the benchmarks?

predicted NO

@JonasVollmer It's still not on Chatbot Arena, which is the preferred benchmark. Shouldn't you wait until the market close in case it gets uploaded there? Chatbot Arena scores can be quite different from other benchmarks. I was betting on that.

predicted YES

@Shump Ok, will hold off on resolving YES based on this!

predicted NO

It's apparently more capable by MMLU/GSM8k/MATH/HumanEval, although those are not directly related to how much people like it/arena score.


GPT-3.5 isn't really state of the art anymore. There are open source models that beat it on most metrics.

What has xAI done, besides be announced?

predicted NO

@dominic I been saying this

@benshindel Building products & doing stuff takes time. Consider: There were 2½ years between GPT-3 and ChatGPT

@dominic lol looks like this was not a great take

predicted NO

Another question: Llama-2 seems lower on the leaderboard than GPT3.5

Why is the title in disagreement with the description?

predicted YES

@BenjaminShindel updated the description to remove Llama 2 (thought this would be most fair to you given that you're the largest NO holder)

@JonasVollmer People subjectively prefer LLAMA2 over GPT-3.5 by far.

Try out https://llmboxing.com/

predicted YES

@firstuserhere it does worse on the benchmarks I linked to. Not sure why

predicted NO

@JonasVollmer Thx! Although tbh it wouldn’t impact my betting that much as I mostly just think it’s <75% likely they’ll have developed any public LLM at all by June

predicted NO

Is there any evidence that xAI has even begun to train LLMs or that they plan on doing so in the next 9 months?

Subjective quality or on benchmarks, or on leaderboards? Or a general qualitative answer?

If the latter, you may wanna frame the market similar to Peter Wildeford's following ones:

predicted YES

@firstuserhere Added: "By default, I will use the Arena Elo rating to decide whether a model meets the bar. If there is no such rating, I will use other benchmarks (e.g., MMLU) or my subjective impression."

hm, i mean given the list of people + dan advising it, most likely a strong yes, and since timelines are pretty quick june is actually a reasonably solid estimate

i think hardest roadblock would be time for training + finding good enough data. it could also be the case that they dont actually go towards LM's immediately which seems pretty low probability (although i would be interested in looking at if they did some autoformalization stuff especially)

predicted NO

@astyerche now realizing this says as powerful and not more powerful oof

More related questions