Will any group develop a GPT-4 competitor of comparable capability (text) by Oct. 2023
88
647
αΉ€1.7K
resolved Sep 30
Resolved
NO

"comparable capability" refers to the text performance only, to keep it simple (the hardest part of making GPT-4 is the language model anyway). To be comparable the performance should be equal or extremely close across a wide range of benchmarks (e.g. MMLU, HumanEval, WinoGrande) and tests (e.g. SAT). It should also have at least 8k context length (chosen since GPT-4 has 8k and 32k context length versions).

Of course, to qualify as YES, the group that develops a competitor must publicly announce that they trained an LLM with the benchmark results.

Get αΉ€1,000 play money

πŸ… Top traders

#NameTotal profit
1αΉ€617
2αΉ€286
3αΉ€187
4αΉ€110
5αΉ€102
Sort by:

Really hard to justify resolving YES based on PaLM 2, so it had to be NO. The lessons I learned from judging this market were incorporated into my new market - please go bet there! https://manifold.markets/hyperion/will-google-gemini-perform-better-t

sold αΉ€11 of YES

Some analysis of PaLM 2 vs GPT-4. This is hard to benchmark because PaLM 2 Large is not public and we can only rely on the technical report.

At first glance, PaLM 2's results in the technical report appear ~5% below GPT-4 across most standard benchmarks (e.g. MMLU, Hellaswag, Winogrande, etc). However, Table 5 on PaLM 2 technical report shows results from using CoT which are better than or equal to GPT-4 on Winogrande and DROP. Sourcing other benchmarks around the internet of GPT-4 on StrategyQA and CSQA show that PaLM 2's published results are slightly better. GPT-4 is notably better at coding, however, judging by the Humaneval results.

The main reason to resolve NO here, however, is that PaLM 2 (as described in the report) is not a chatbot: it is a base model instruction-finetuned on FLAN, not a full RLHF finetune on millions of human preferences.
If we were considering base models only, I think that PaLM 2 has comparable capability. However, considering full chat capabilities that people care about, I expect PaLM 2 to be slightly worse to use. Therefore I will not consider PaLM 2 sufficient to meet the bar for this question, unless there are well-argued objections.

I'd like to bet on this market but I find the question + description too vague.

predicted YES

@Felle How can the resolution be improved? It seems pretty clear to me, since I explicitly name benchmarks.

bought αΉ€100 of NO

I'm buying more NO because there's only a month left now, but:

"The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens."

"[...] based on Code Llama [...] WizardCoder-34B surpasses (initial) GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass"

So it meets context size and at least one benchmark, though it's limited to code only.

predicted YES

@Mira What do you think about the PaLM 2 published results which are competitive? I'm unsure

predicted NO

@hyperion Google Bard is using that, right? I'm surprised they claim it matches GPT-4: I tested it months ago, was very disappointed, and haven't really thought about it since. As the market requires, they did announce it and they did find some benchmarks, so I would understand if it's technically a YES.

One example of a benchmark where it's worse: it's 200 ELO points below GPT-4 on this leaderboard, and 100 points below GPT-3.5-turbo: https://lmsys.org/blog/2023-05-25-leaderboard/

Everyone finetuning language models uses GPT-4 as the gold standard to rate responses and generate synthetic data. I haven't heard of anyone using PaLM 2 for this. This would be similar to a "citation count" metric.

HumanEval is also mentioned in the market description and is 37.6 vs. GPT-4's 67 at release.

IMO Anthropic's Claude is the only LLM that comes close to GPT-4. The next version of Claude might be competitive.

predicted YES

@Mira I don't believe the PaLM 2 API or Bard is the most capable version in the paper. The benchmarks in the actual paper are fairly convincing

predicted YES

New: In a rare collaboration between DeepMind and Google Brain, software engineers at both Alphabet AI units are working together on a GPT-4 rival that would have up to 1 trillion parametersβ€”a project known internally as Gemini.

https://www.theinformation.com/articles/alphabets-google-and-deepmind-pause-grudges-join-forces-to-chase-openai