Resolution criteria
This market resolves based on which company's AI model is ranked as the best overall at the end of January 2026 according to the most widely-cited independent benchmarking sources. Primary sources for resolution include:
LM Council's benchmark comparison tool across 20+ benchmarks including Humanity's Last Exam, FrontierMath, GPQA, and SWE-bench
Artificial Analysis Intelligence Index v4.0, which weights four equal pillars: Agents, Coding, Scientific, and General
LMArena leaderboard (formerly Chatbot Arena), which uses blinded head-to-head battles where humans vote on the better answer
If sources conflict on which model is "best," the market resolves to whichever company's model appears most frequently at the top of these three leaderboards. The resolution will be determined by the state of these leaderboards on January 31, 2026.
Background
As of January 2026, the battle for the top spot has intensified with major updates to the Chatbot Arena leaderboard and the release of the Artificial Analysis Intelligence Index v4.0. On LMArena's Text leaderboard, Gemini 3 Pro leads user-preference rankings, while the updated Artificial Analysis Intelligence Index v4.0 reports GPT-5.2 (with extended reasoning) as the top overall benchmark performer.
Gemini 3 Pro from Google is consolidating its position as the global leader, while Claude Opus 4.5 and GPT-5.2 are waging a fierce war on the grounds of code and pure reasoning. Meanwhile, the Chinese outsider DeepSeek V3.2 is reshuffling the economic cards with unbeatable costs.
Performance varies by task: Claude Opus 4.5 still holds the top score on SWE-bench Verified at 80.9%, though early results may be unstable. GPT-5.2's 80.0% closes what had been a more significant gap. GPT-5.2's most striking claim is its performance on ARC-AGI-2, a benchmark designed to test genuine reasoning ability while resisting memorization. At 52.9% (Thinking) and 54.2% (Pro), OpenAI's new model significantly outranks both Claude Opus 4.5 (37.6%) and Gemini 3 Deep Think (45.1%).
Considerations
The data for January 2026 is clear: specialization has arrived. No single model wins every category. Different benchmarks measure different capabilities—user preference, coding ability, mathematical reasoning, and general knowledge work—so the "best" model depends on which metrics are weighted most heavily. The market resolves based on aggregate rankings across multiple independent sources rather than any single benchmark.