"perform better" refers to the text performance only, to keep it simple. To be comparable the performance should be equal or extremely close across a wide range of benchmarks (e.g. MMLU, HumanEval, WinoGrande) and chat/agent tests (e.g. MT-Bench). It should also have at least 8k context length (chosen since GPT-4 has 8k and 32k context length versions).
Of course, to qualify as YES, the group that develops a competitor must publicly announce that they trained an LLM with the benchmark results, or make an API available to external evaluators. If Gemini is released exclusively through a chat interface and the only benchmarks are internal to Google, then this market will resolve N/A because of a lack of sufficient information.
Market will resolve as soon as we can get accurate evaluations for Gemini after it releases. The only situation in which this market should make it to its end date is if Gemini is not released to external evaluators by EOY 2024.
GPT-4's reference results will be the GPT-4 API at the time of Gemini evaluation (i.e. same month). If GPT-4.5 releases, this will not be considered.
Update 2025-02-01 (PST) (AI summary of creator comment): - Resolution: YES based on Gemini 1.5 Pro 002 vs GPT-4o
Benchmarks: Google Developers Blog
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ266 | |
2 | Ṁ249 | |
3 | Ṁ175 | |
4 | Ṁ146 | |
5 | Ṁ139 |
Resolution: YES based on Gemini 1.5 Pro 002 vs GPT-4o on these benchmarks: https://developers.googleblog.com/en/updated-gemini-models-reduced-15-pro-pricing-increased-rate-limits-and-more/
@hyperion "as soon as we get an accurate evaluation of Gemini" is probably closer to this release. https://blog.google/technology/ai/google-gemini-ai/#performance
The result is the same, so doesn't matter anyway.
@MikhailDoroshenko Ok, maybe not, I remember there being a lot of arguments about CoT@32 vs 5-shot. I don't have any stakes here, so doesn't matter much to me, but makes the resolution debatable imo.