Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?
2
100Ṁ20
2026
59%
chance
4

BASELINE (2025 leader): The current leader as of Aug 2025 appears to be GPT-4.5 with ~90.2% on MMLU, with Claude 4 and Gemini 2.5 Pro at ~85-86%

Top 7 LLMs Ranked in 2025: GPT-4o, Gemini, Claude & More.

To establish the 2025 baseline:

  • On Dec 31, 2025, identify the LLM with the highest average score across the "Core Benchmark Suite" (defined below)

  • This becomes the baseline for calculating 10% improvement

CORE BENCHMARK SUITE (to avoid cherry-picking):

  1. MMLU (general knowledge)

  2. HumanEval (coding)

  3. GSM8K (math reasoning)

  4. ARC-Challenge (scientific reasoning)

  5. GPQA (expert-level knowledge)

RESOLUTION CRITERIA:

  • On Dec 31, 2026, identify the highest-scoring LLM on the same benchmark suite

  • Calculate the percentage improvement: (2026_score - 2025_score) / 2025_score × 100

  • BET RESOLVES YES if improvement is less than 10%

  • BET RESOLVES NO if improvement is 10% or greater

KEY DEFINITIONS:

  • "LLM": Text-based language models (excludes multimodal-only systems)

  • "Publicly available": Model must be accessible via API, open-source, or major consumer platform

  • "Score sources": Use official leaderboards (HuggingFace, Papers with Code) or company-reported figures

  • "Average": Simple arithmetic mean across the 5 benchmarks

EDGE CASES:

  • If benchmarks become saturated (>98% scores), substitute with the most widely-adopted replacement benchmark

  • If a benchmark is discontinued, use the closest equivalent as determined by academic consensus

  • Minimum 3 valid benchmark scores required for inclusion

Example calculation:

  • 2025 leader: 85% average

  • 2026 leader: 92% average

  • Improvement: (92-85)/85 = 8.2% → YES (less than 10%)

Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ10 YES

These benchmarks are saturated or close enough

© Manifold Markets, Inc.TermsPrivacy