Will the gap between open-weights and frontier models on GPQA be at most 7%?
Mini
5
แน€150
2026
52%
chance

At the end of 2026, there will be a model that performs best on GPQA. There will also be an open-weights model that performs best on GPQA.

Question resolves positively if and only if the score of the best open-weights model on 0-shot CoT GPQA is at most 7% less than the score of the best-performing model on 0-shot CoT GPQA.

As of the time of writing, the model that performs best on GPQA is Claude Sonnet 3.5, with a score of 59.4. The best performing open-weights model is Llama 3.1-405B, with a score of 51.1. This would not be sufficient for a positive resolution, as the gap is 8.3%. If the gap is exactly 7%, the question still resolves positively, but if it is 7.1%, it resolves negatively. The question also resolves positively if open-weights models are at the frontier on GPQA (i.e. if they beat closed-weights models).

Get แน€1,000 play money
Sort by:
bought แน€25 YES

Interesting question! I won't be surprised if the benchmark saturates by the end of 2026.

Maybe worth a question as well ๐Ÿค”

If GPQA saturates by 2026, then this question would almost always resolve yes. It might be worth asking the question in a more general way, like "will an open weights model get within 7% of performance average on leading benchmarks in 2026"

Makes sense, I guess I wanted to have something concrete, still being influenced by the Metaculus question formulation.

I don't like saying "leading benchmark" because that's quite ambiguous. SWE-bench seems difficult for anything at the moment, but it's not limited to single models.