Scores 73+ on artificial analysis intelligence index
40+% HLE
75+% www.swebench.com
Resolves to the number of these the best Gemini 3 model achieves at launch. If one of them is not known by the time /Balasar/gemini-3-exceeds-expectations resolves, I will count it as not having happened.
Update 2025-11-18 (PST) (AI summary of creator comment): HLE (Humanity's Last Exam) will be resolved based on the Scale.com leaderboard: https://scale.com/leaderboard/humanitys_last_exam
Update 2025-12-20 (PST) (AI summary of creator comment): When determining which Gemini 3 model's scores to count, "best" refers to the overall model, not benchmark-by-benchmark scores. If a non-"best" model (like Flash) achieves higher scores on individual benchmarks, those scores will likely not be counted unless that model achieves more criteria overall by resolution time.
People are also trading
@robert Flash released a month later. It's not considered the best model, but apparently, it did score above 75% on SWE. So, how would this item resolve?
@adonisds The use of "best" refers to model, rather than benchmark by benchmark score. If flash achieves more by the time the "exceeds expectations" market resolves, I guess I would count all of flash's results. But that seems unlikely.
Tldr: probably doesn't count
@Fynn I will resolve based on https://scale.com/leaderboard/humanitys_last_exam
Not sure if that is with or without tools