Will there be 20+ LLMs that match or outperform GPT-3.5's performance by the end of 2024?
Dec 31

As of market creation, there are a few, but not 20. Off the top of my mind, we have

  • Mistral Mixtral

  • Inflection-2

  • Anthropic Claude 2

  • Google Gemini Pro

  • Grok.

  • GPT-4

Would the following models count:
- Anthropic Claude 1 (outranks GPT-3.5 here)
- Anthropic Claude 2.1
- GPT-4-Turbo

In general, if you define GPT-3.5 as "GPT-3.5-Turbo-1106", then there are already 17 models outranking it on this leaderboard https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

@JonasVollmer , by GPT3.5, it was meant the original version: gpt-3.5-turbo-0613, which will deprecate soon. I'll operationalize the market better.

Would 20 slightly different variations of e.g. Mixtral satisfy?

@JuJumper If they're released as separate models, they might count separate (depending on details)

@firstuserhere what counts as a release?

@JuJumper If it's accessible to people not affiliated with the creator of the model, that's a public release

@firstuserhere My suggestion is that separate versions of similar models should not count separately, and you should only count one per series of models (i.e. only the best OpenAI GPT, only the best Gemini model, only the best PaLM model, etc)

Mixtral does not currently outperform GPT-3.5-Turbo on most benchmarks: https://arxiv.org/abs/2312.11444

(Although I'm unsure whether that paper was using the instruction fine-tuned version of Mixtral, which could make a big difference.)

Have to be trained from scratch?

Whats the criteria for assessing performance?

@B1e0e Benchmark results

What if they keep updating 3.5-t but keep calling it the same name.

@VAPOR GPT-4 added to the list

