Will any group develop a GPT-4 competitor of comparable capability (text) by Oct. 2023
closes Sep 30

"comparable capability" refers to the text performance only, to keep it simple (the hardest part of making GPT-4 is the language model anyway). To be comparable the performance should be equal or extremely close across a wide range of benchmarks (e.g. MMLU, HumanEval, WinoGrande) and tests (e.g. SAT). It should also have at least 8k context length (chosen since GPT-4 has 8k and 32k context length versions).

Of course, to qualify as YES, the group that develops a competitor must publicly announce that they trained an LLM with the benchmark results.

Get Ṁ500 play money

Related questions

Will GPT-4 fine-tuning be available by October 1st?
getby avatarI get down
5% chance
Will GPT 4.5 be announced by October? (2023)
Mira avatarMira
6% chance
Will OpenAI release GPT-4 finetuning by Fall 2023?
Mira avatarMira
65% chance
Will there be a GPT-4 Instruct model released in 2023?
Mira avatarMira
39% chance
Will GPT-4 fine-tuning be available by EOY?
getby avatarI get down
65% chance
Will GPT-4 per-token price decrease by the end of Q3'2023?
Will GPT-5 be released before 2025?
VictorLJZ avatarVictor Li
56% chance
Will a large GPT-4 equivalent competitor model be revealed by the end of 2023?
Will GPT-4's parameter count be announced by the end of 2023?
ada avatarada
21% chance
GPT4 or better model available for download by EOY 2024?
Will GPT-4's max context window increase by the end of 2023?
When will OpenAI release multimodal GPT-4 for public use?
Will GPT-4's parameter count be known by end of 2024?
Mira avatarMira
41% chance
Will OpenAI announce availability of tunable GPT-4 on November 6th, 2023 - developers day
IsaacKohane avatarIsaac Kohane
36% chance
Will OpenAI's GPT-4 API support image inputs in 2024?
96% chance
Will the cost of GPT-4 API decrease by the end of 2023?
Will mechanistic interpretability be essentially solved for GPT-2 before 2030?
MatthewBarnett avatarMatthew Barnett
30% chance
Will Open AI release a moderation AI tool using GPT-4 this year (2023)?
firstuserhere avatarfirstuserhere
45% chance
Will we train GPT-4 to generate resolution criteria better than the creator 50% of the time by the end of 2023?
CrystalBallin avatarCrystal Ballin'
30% chance
When will GPT-5 be released? (2025)
Mira avatarMira
41% chance
Sort by:
hyperion avatar
hyperionsold Ṁ11 of YES

Some analysis of PaLM 2 vs GPT-4. This is hard to benchmark because PaLM 2 Large is not public and we can only rely on the technical report.

At first glance, PaLM 2's results in the technical report appear ~5% below GPT-4 across most standard benchmarks (e.g. MMLU, Hellaswag, Winogrande, etc). However, Table 5 on PaLM 2 technical report shows results from using CoT which are better than or equal to GPT-4 on Winogrande and DROP. Sourcing other benchmarks around the internet of GPT-4 on StrategyQA and CSQA show that PaLM 2's published results are slightly better. GPT-4 is notably better at coding, however, judging by the Humaneval results.

The main reason to resolve NO here, however, is that PaLM 2 (as described in the report) is not a chatbot: it is a base model instruction-finetuned on FLAN, not a full RLHF finetune on millions of human preferences.
If we were considering base models only, I think that PaLM 2 has comparable capability. However, considering full chat capabilities that people care about, I expect PaLM 2 to be slightly worse to use. Therefore I will not consider PaLM 2 sufficient to meet the bar for this question, unless there are well-argued objections.

Felle avatar

I'd like to bet on this market but I find the question + description too vague.

1 reply
hyperion avatar
hyperionpredicts YES

@Felle How can the resolution be improved? It seems pretty clear to me, since I explicitly name benchmarks.

Mira avatar
Mirabought Ṁ100 of NO

I'm buying more NO because there's only a month left now, but:

"The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens."

"[...] based on Code Llama [...] WizardCoder-34B surpasses (initial) GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass"

So it meets context size and at least one benchmark, though it's limited to code only.

3 replies
hyperion avatar
hyperionpredicts YES

@Mira What do you think about the PaLM 2 published results which are competitive? I'm unsure

Mira avatar
Mirapredicts NO

@hyperion Google Bard is using that, right? I'm surprised they claim it matches GPT-4: I tested it months ago, was very disappointed, and haven't really thought about it since. As the market requires, they did announce it and they did find some benchmarks, so I would understand if it's technically a YES.

One example of a benchmark where it's worse: it's 200 ELO points below GPT-4 on this leaderboard, and 100 points below GPT-3.5-turbo: https://lmsys.org/blog/2023-05-25-leaderboard/

Everyone finetuning language models uses GPT-4 as the gold standard to rate responses and generate synthetic data. I haven't heard of anyone using PaLM 2 for this. This would be similar to a "citation count" metric.

HumanEval is also mentioned in the market description and is 37.6 vs. GPT-4's 67 at release.

IMO Anthropic's Claude is the only LLM that comes close to GPT-4. The next version of Claude might be competitive.

hyperion avatar
hyperionpredicts YES

@Mira I don't believe the PaLM 2 API or Bard is the most capable version in the paper. The benchmarks in the actual paper are fairly convincing

ersatz avatar
ersatzpredicts YES

New: In a rare collaboration between DeepMind and Google Brain, software engineers at both Alphabet AI units are working together on a GPT-4 rival that would have up to 1 trillion parameters—a project known internally as Gemini.