Background
PutnamBench is a multilingual benchmark that translates the famously difficult William Lowell Putnam Mathematical Competition into formal theorem-proving tasks: its current release contains 657 Lean problems, 640 Isabelle/HOL problems, and 412 Coq problems. The original Putnam contest—two three-hour sessions of six proof-style questions worth 120 points—has a typical median score of just 0 or 1 among the ~4 000 elite undergraduates who sit it each year, underscoring the conceptual creativity required. By encoding every problem in three major proof assistants, PutnamBench measures an AI’s ability not merely to produce correct final answers but to generate machine-checkable proofs across multiple formal languages and undergraduate fields such as algebra, combinatorics, analysis and number theory.
State of Play:
As of July 2025, the PutnamBench leaderboard highlights just how far current automated theorem-provers still have to go: in the Lean 4 track (657 problems), DeepSeek-Prover-V2 tops the board with 47 proofs, yielding 7.2 % coverage; on the Isabelle/HOL track (640 problems), a GPT-4o-based wrapper dubbed DSP has verified only 4 theorems, or 0.6 % of the set; and on the Coq track (412 problems), the GPT-4o effort has managed a single proof at 0.2 % accuracy.
Why this milestone matters
Undergraduate-level proof mastery. The Putnam exam is so difficult that the median human score is typically 0 or 1 out of 120 despite attracting top mathematics majors.
Cross-formalism generalisation. Success in three very different proof assistants (Lean, Isabelle, Coq) would show language-agnostic reasoning progress.
Giant performance leap. Current state-of-the-art—DeepSeek-Prover-V2’s 47/657 Lean proofs and DSP’s 4 Isabelle proofs—is still < 8 % on the easiest track and < 1 % on the others.
Research head-room. Bridging the gap to 95 % requires roughly a 13× increase in Lean solves and 150× in Isabelle/Coq, far beyond pass-at-k sampling tricks.
Resolution Criteria
This market resolves to the first year in which ALL conditions below are met; otherwise it stays unresolved:
Leaderboard evidence – The official PutnamBench leaderboard shows a single AI model with the following scores in these following three tracks:
Lean ≥ 624/657, Isabelle ≥ 608/640, Coq ≥ 392/412.
Public verification – The run is corroborated by either
(a) a peer-reviewed paper (arXiv / NeurIPS / ICLR etc.) describing the method, or
(b) a public repository accepted by the PutnamBench maintainers.
No human help after launch – Human engineers may design the system, but all proofs must be generated autonomously during evaluation; unlimited compute, external tools, or retrieval are allowed if invoked by the agent itself.
Expiry – If no qualifying run is verified by Jan 1, 2041 the market resolves “Not Applicable.”