
"comparable capability" refers to the text performance only, to keep it simple (the hardest part of making GPT-4 is the language model anyway). To be comparable the performance should be equal or extremely close across a wide range of benchmarks (e.g. MMLU, HumanEval, WinoGrande) and tests (e.g. SAT). It should also have at least 8k context length (chosen since GPT-4 has 8k and 32k context length versions).
Of course, to qualify as YES, the group that develops a competitor must publicly announce that they trained an LLM with the benchmark results.
Related questions

Some analysis of PaLM 2 vs GPT-4. This is hard to benchmark because PaLM 2 Large is not public and we can only rely on the technical report.
At first glance, PaLM 2's results in the technical report appear ~5% below GPT-4 across most standard benchmarks (e.g. MMLU, Hellaswag, Winogrande, etc). However, Table 5 on PaLM 2 technical report shows results from using CoT which are better than or equal to GPT-4 on Winogrande and DROP. Sourcing other benchmarks around the internet of GPT-4 on StrategyQA and CSQA show that PaLM 2's published results are slightly better. GPT-4 is notably better at coding, however, judging by the Humaneval results.
The main reason to resolve NO here, however, is that PaLM 2 (as described in the report) is not a chatbot: it is a base model instruction-finetuned on FLAN, not a full RLHF finetune on millions of human preferences.
If we were considering base models only, I think that PaLM 2 has comparable capability. However, considering full chat capabilities that people care about, I expect PaLM 2 to be slightly worse to use. Therefore I will not consider PaLM 2 sufficient to meet the bar for this question, unless there are well-argued objections.

I'm buying more NO because there's only a month left now, but:
"The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens."
"[...] based on Code Llama [...] WizardCoder-34B surpasses (initial) GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass"
So it meets context size and at least one benchmark, though it's limited to code only.

@hyperion Google Bard is using that, right? I'm surprised they claim it matches GPT-4: I tested it months ago, was very disappointed, and haven't really thought about it since. As the market requires, they did announce it and they did find some benchmarks, so I would understand if it's technically a YES.
One example of a benchmark where it's worse: it's 200 ELO points below GPT-4 on this leaderboard, and 100 points below GPT-3.5-turbo: https://lmsys.org/blog/2023-05-25-leaderboard/
Everyone finetuning language models uses GPT-4 as the gold standard to rate responses and generate synthetic data. I haven't heard of anyone using PaLM 2 for this. This would be similar to a "citation count" metric.
HumanEval is also mentioned in the market description and is 37.6 vs. GPT-4's 67 at release.
IMO Anthropic's Claude is the only LLM that comes close to GPT-4. The next version of Claude might be competitive.

New: In a rare collaboration between DeepMind and Google Brain, software engineers at both Alphabet AI units are working together on a GPT-4 rival that would have up to 1 trillion parameters—a project known internally as Gemini.





















