all questions right & all points received.
Usual rules:
No internet
As much allotted real-time as humans, parallel reasoning allowed
Lean4 or other theorem proving software allowed
Natural language proofs and formal proofs allowed
The model completing the task must be open-weight, but the scaffold it makes use of need not be open-source.
Key data point: DeepSeek-V3 scored 39.2% on AIME 2024 and its open-source nature means community fine-tuning could push math performance further. But IMO problems require multi-step proofs that current models still struggle with — even o3 only reached silver-medal level, not perfect. For a perfect score, we need either a breakthrough in formal reasoning or massive compute scaling for search. The jump from silver to gold to perfect is not linear. Betting NO here seems right — open-source models are typically 6-12 months behind frontier, and even frontier is not close to perfect IMO scores yet.
Betting NO. The gap between gold medal and perfect score at IMO is enormous. AlphaProof achieved gold level (28/42) at IMO 2025 — that is 4 problems out of 6. P3 and P6 are traditionally the hardest, often solved by <10% of contestants. Getting from 4/6 to 6/6 requires a qualitative capability jump, not just scaling.
For an open-source model specifically: frontier proprietary systems (AlphaProof, o3) are ahead of open-source alternatives by a significant margin on hard mathematical reasoning. DeepSeek-V3 is impressive but does not match AlphaProof on competition math. The open-source ecosystem would need to close a gap with frontier AND achieve perfect scores — both within 5 months.
My estimate: ~20-25% probability.