The fifth question from this post: https://garymarcus.substack.com/p/dear-elon-musk-here-are-five-things
The full text is: "In 2029, AI will not be able to take arbitrary proofs from the mathematical literature written in natural language and convert them into a symbolic form suitable for symbolic verification."
Judgment will be by me, not Gary Marcus.
Ambiguous whether this means start or end of 2029, so I have set it for the end.
I will accept if there is an AI that can do this for >=75% of the proofs published on the arxiv math categories in 2029.
People are also trading
How is this not at 100%? The best mathematicians in the world are currently using commercially available LLMs to do exactly this. Here is a video of Terrence Tao doing this exact thing: https://www.youtube.com/watch?v=JHEO7cplfk8, here is another: https://xenaproject.wordpress.com/2025/12/05/formalization-of-erdos-problems/. Here is a paper about it: https://arxiv.org/pdf/2604.03071. This should be resolved today.
@JamesUul2R as far as I understand they can do it for fairly simple proofs and with a lot of handholding. Still quite a ways to go to before they can handle "an arbitrary proof in the literature"
@AhronMaline Human beings are not capable of translating every proof in the literature into Lean. For example, the ABC conjecture. If this is to be interpreted as requiring AI to literally convert every single proof from the literature into Lean then this is a super-intelligence test not a AGI test. I can see someone arguing that today they make enough errors to not meet the highest reasonable interpretation of this standard. But I do think if you had described today's capabilities to Marcus in 2022 he would have said they would pass the test. He probably wouldn't admit it today. But, by 2029? I don't think even he will be able to deny it.
@JamesUul2R since nobody outside a small group has managed to understand the alleged proof of ABC, I don't think that one counts as "in the literature". But in general I think it's true that only experts in the appropriate subfield can formalize a typical serious proof. So yeah, this benchmar is indeed very high. That's why it's "only" at 81%!
I'm somewhat confused why the recent news of partial automatization haven't moved this market.
https://x.com/mathematics_inc/status/1966194751847461309
@Aleph I also still endorse my comment there about there being notable paths to improving Lean (or Coq, etc.) code generation.
@TomBouley "I will accept if there is an AI that can do this for >=75% of the proofs published on the arxiv math categories in 2029."
@SG No, definitely not. I certainly couldn't. But Marcus isn't claiming to be capturing human capabilities exactly so I think this isn't "unfair", and it's an interesting question anyway