BASELINE (2025 leader): The current leader as of Aug 2025 appears to be GPT-4.5 with ~90.2% on MMLU, with Claude 4 and Gemini 2.5 Pro at ~85-86%
Top 7 LLMs Ranked in 2025: GPT-4o, Gemini, Claude & More.
To establish the 2025 baseline:
On Dec 31, 2025, identify the LLM with the highest average score across the "Core Benchmark Suite" (defined below)
This becomes the baseline for calculating 10% improvement
CORE BENCHMARK SUITE (to avoid cherry-picking):
MMLU (general knowledge)
HumanEval (coding)
GSM8K (math reasoning)
ARC-Challenge (scientific reasoning)
GPQA (expert-level knowledge)
RESOLUTION CRITERIA:
On Dec 31, 2026, identify the highest-scoring LLM on the same benchmark suite
Calculate the percentage improvement: (2026_score - 2025_score) / 2025_score × 100
BET RESOLVES YES if improvement is less than 10%
BET RESOLVES NO if improvement is 10% or greater
KEY DEFINITIONS:
"LLM": Text-based language models (excludes multimodal-only systems)
"Publicly available": Model must be accessible via API, open-source, or major consumer platform
"Score sources": Use official leaderboards (HuggingFace, Papers with Code) or company-reported figures
"Average": Simple arithmetic mean across the 5 benchmarks
EDGE CASES:
If benchmarks become saturated (>98% scores), substitute with the most widely-adopted replacement benchmark
If a benchmark is discontinued, use the closest equivalent as determined by academic consensus
Minimum 3 valid benchmark scores required for inclusion
Example calculation:
2025 leader: 85% average
2026 leader: 92% average
Improvement: (92-85)/85 = 8.2% → YES (less than 10%)