Will the gap between open-weights and frontier models on GPQA Diamond be at most 7%?

Ṁ100Ṁ1.1k

Dec 31

chance

ALL

At the end of 2026, there will be a model that performs best on GPQA Diamond. There will also be an open-weights model that performs best on GPQA Diamond.

Question resolves positively if and only if the score of the best open-weights model on 0-shot CoT GPQA is at most 7% less than the score of the best-performing model on 0-shot CoT GPQA.

As of the time of writing, the model that performs best on GPQA Diamond is Claude Sonnet 3.5, with a score of 59.4. The best performing open-weights model is Llama 3.1-405B, with a score of 51.1. This would not be sufficient for a positive resolution, as the gap is 8.3%. If the gap is exactly 7%, the question still resolves positively, but if it is 7.1%, it resolves negatively. The question also resolves positively if open-weights models are at the frontier on GPQA (i.e. if they beat closed-weights models).

Market context

Technical AI Timelines

Math

LLMs

Open Source

Get

1,000

to start trading!

People are also trading

Weights in largest open-weight AI model before October 2026?

Weights in largest open-weight AI model before 2027?

Will AI models solve at least 2 FrontierMath Open Problems before 2027?

83% chance

Will FP8 be the primary pretraining precision for a majority of frontier models released in 2027?

41% chance

Will AIs beat human experts in question-answering on the GPQA benchmark before January 1st, 2027?

95% chance

Will a frontier model score above 90% on the APEX-SWE benchmark before 2028?

71% chance

Will any frontier model score LOWER than its predecessor on a major benchmark at launch?

43% chance

Before 2027, will OpenAI release a frontier model with a 5:1 or better abstention to hallucination ratio on SimpleQA?

61% chance

In what year will AI achieve a score of 95% or higher on the GPQA benchmark?

8/14/26

Will GigaChat release an open-weights model with ≥100B parameters by the end of 2026?

45% chance

Sort by:

bought Ṁ200 NO

Top performing model is Gemini 3.0 Pro at 90.3% via https://artificialanalysis.ai/evaluations/gpqa-diamond, top performing open-weights model is GLM-4.7 at 85.9%, available on HuggingFace.

Currently 4.2% difference on GPQA diamond.

@manic_pixie_agi @ArielG @PhilosophyBear @acertain @DimlakGorkehgz would you be against changing the benchmark to GPQA diamond?

@NiplavYushtun no objections from me

@NiplavYushtun Currently 11.1% (between Grok 4 and Kimi 2), via https://artificialanalysis.ai/evaluations/gpqa-diamond.

Unfortunately it does look like GPQA diamond is saturating. Another market using performance on Nethack?

bought Ṁ25 YES

Interesting question! I won't be surprised if the benchmark saturates by the end of 2026.

Maybe worth a question as well 🤔

If GPQA saturates by 2026, then this question would almost always resolve yes. It might be worth asking the question in a more general way, like "will an open weights model get within 7% of performance average on leading benchmarks in 2026"

Makes sense, I guess I wanted to have something concrete, still being influenced by the Metaculus question formulation.

I don't like saying "leading benchmark" because that's quite ambiguous. SWE-bench seems difficult for anything at the moment, but it's not limited to single models.