
At the end of 2026, there will be a model that performs best on GPQA Diamodn. There will also be an open-weights model that performs best on GPQA Diamond.
Question resolves positively if and only if the score of the best open-weights model on 0-shot CoT GPQA is at most 7% less than the score of the best-performing model on 0-shot CoT GPQA.
As of the time of writing, the model that performs best on GPQA Diamond is Claude Sonnet 3.5, with a score of 59.4. The best performing open-weights model is Llama 3.1-405B, with a score of 51.1. This would not be sufficient for a positive resolution, as the gap is 8.3%. If the gap is exactly 7%, the question still resolves positively, but if it is 7.1%, it resolves negatively. The question also resolves positively if open-weights models are at the frontier on GPQA (i.e. if they beat closed-weights models).
People are also trading
@manic_pixie_agi @ArielG @PhilosophyBear @acertain @DimlakGorkehgz would you be against changing the benchmark to GPQA diamond?
Interesting question! I won't be surprised if the benchmark saturates by the end of 2026.
If GPQA saturates by 2026, then this question would almost always resolve yes. It might be worth asking the question in a more general way, like "will an open weights model get within 7% of performance average on leading benchmarks in 2026"
Makes sense, I guess I wanted to have something concrete, still being influenced by the Metaculus question formulation.
I don't like saying "leading benchmark" because that's quite ambiguous. SWE-bench seems difficult for anything at the moment, but it's not limited to single models.