Will the gap between open-weights and frontier models on GPQA Diamond be at most 7%?

100Ṁ599

Jan 1

49%

chance

ALL

At the end of 2026, there will be a model that performs best on GPQA Diamodn. There will also be an open-weights model that performs best on GPQA Diamond.

Question resolves positively if and only if the score of the best open-weights model on 0-shot CoT GPQA is at most 7% less than the score of the best-performing model on 0-shot CoT GPQA.

As of the time of writing, the model that performs best on GPQA Diamond is Claude Sonnet 3.5, with a score of 59.4. The best performing open-weights model is Llama 3.1-405B, with a score of 51.1. This would not be sufficient for a positive resolution, as the gap is 8.3%. If the gap is exactly 7%, the question still resolves positively, but if it is 7.1%, it resolves negatively. The question also resolves positively if open-weights models are at the frontier on GPQA (i.e. if they beat closed-weights models).

Technical AI Timelines

LLMs

Math

Open Source

Get

1,000

to start trading!

People are also trading

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

56% chance

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

44% chance

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

3% chance

Before 2026, will frontier AI models get much better at expressing calibrated uncertainty in their answers?

65% chance

Will an AI achieve >85% performance on the FrontierMath benchmark before 2027?

38% chance

Weights in largest open-weight AI model before July 2026?

In what year will AI achieve a score of 95% or higher on the GPQA benchmark?

5/25/27

Will AIs beat human experts in question-answering on the GPQA benchmark before January 1st, 2027?

95% chance

Before 2027, will OpenAI release a frontier model with a 5:1 or better abstention to hallucination ratio on SimpleQA?

51% chance

Will GigaChat release an open-weights model with ≥100B parameters by the end of 2026?

Sort by:

Currently 4.2% difference on GPQA diamond.

@manic_pixie_agi @ArielG @PhilosophyBear @acertain @DimlakGorkehgz would you be against changing the benchmark to GPQA diamond?

@NiplavYushtun no objections from me

@NiplavYushtun Currently 11.1% (between Grok 4 and Kimi 2), via https://artificialanalysis.ai/evaluations/gpqa-diamond.

Unfortunately it does look like GPQA diamond is saturating. Another market using performance on Nethack?

bought Ṁ25 YES

Interesting question! I won't be surprised if the benchmark saturates by the end of 2026.

Maybe worth a question as well 🤔

If GPQA saturates by 2026, then this question would almost always resolve yes. It might be worth asking the question in a more general way, like "will an open weights model get within 7% of performance average on leading benchmarks in 2026"

Makes sense, I guess I wanted to have something concrete, still being influenced by the Metaculus question formulation.

I don't like saying "leading benchmark" because that's quite ambiguous. SWE-bench seems difficult for anything at the moment, but it's not limited to single models.