Definitions used:
Under "AI" I consider any fully autonomous computer program that is not given any human input other than the mathematic
Under "recent mathematical results" I consider novel mathematical results that have been published during or after 2010.
Under "fully formal proof" I consider a sequence of proof steps written in a formal mathematical logic such as CIC or ZFC.
Under "mainstream proof assistant" I consider languages such as Isabelle/HOL, Coq, or Lean.
This question will resolve to YES if such an AI can translate and verify at least five recent mathematical results that had not previously been verified in a proof assistant before the start of 2025, and will otherwise resolve to NO.
I feel like this one is a close call. Did someone try o1 pro and got more than a 10% failure rate? What about o3?
A lot of progress has been made on LLMs using Lean in the past 6 months.
@NathanHelmBurger I have not tried anything here, but I think that NO is almost certainly correct. To formalize a recent mathematical result in a given paper, I think one almost always has to formalize lots of background results in references. I haven't really seen anything that purports to be able to crawl through the tree of background results like that, successfully formalizing everything it needs.
@BoltonBailey More to the point, the criteria say "AI can translate and verify at least five recent mathematical results that had not previously been verified". Is there any specific claim about which five results these were that should make this market resolve YES?
https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ though will need more time to see how far this can get on this market.
Is one of the conditions on "recent mathematical result" that the result must be originally proven by a human? Or does something like this count? https://arxiv.org/abs/1605.00723
Under "AI" I consider any fully autonomous computer program that is not given any human input other than the mathematic
I think what you really want is a desiderata that they haven't guided it more than 'natural'. Like, they may have trained the model on a bunch of human written proofs -> formal language, but they haven't specifically given it this proof or any guidance specific to it.
-----
My view is that this could be done this year if someone tried, and that the main limitation is that (some organization) just hasn't strongly trained on a theorem prover yet. MetaAI did a bit of Lean stuff, but not a big model or anywhere near as generic and powerful as GPT.
Like, GPT-3 and friends knows Lean to a not terrible degree, but it is significantly more confused about Lean than it is JavaScript. This is primarily due to the (far lesser) amount of training data, but (some org) could hire a bunch of programmers
to write new Lean proofs
write new Lean proofs with extra finetuning from the human written versions
rewrite existing proofs in multiple ways to give it better generalization (and, the Lean community kinda likes golfing too much which obscures the meaning even for humans, and so probably makes this even more worthwhile)
Which would be good ways to figuratively level-up the understanding quite a bit.
Then there's the fact that Lean, and other proof assistants, have a context that you use to know what the current variables/proofs/etc are in the scope. Existing GPT and friends are writing Lean without the context, which would actually be very hard to do! Especially since various tactics would change what proof a hypothesis has, with that change only being visible from your knowledge of the tactic or through the context.
Essentially: A project could get far more competent Lean results by training with the context. This would also help guard against changes in tactic behavior (which do happen, and is a part of the confusion).
(The org could also do wackier things, like giving it the ability to query what types are available in a package -> filter them for relevant results to the problem so that it definitely fits in context -> apply them; this avoids hallucinating - which is a problem with Lean code gen currently - and makes it more dynamic)
(context bits)
Sagredo, a recently in-development Lean tactic, mixes GPT-4 with Lean 4 in a sort of dialogue. See: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/sagredo (there's a video there, but the video is unlisted so unsure if I'm supposed to link it directly)
Sagredo gives the user's theorem statement (hypothesis parameters, and goal) to GPT-4, telling it to continue the proof. Then, whenever it stops it gives GPT-4 the current context state or an error message if there is one and asks it to continue/fix-the-error.
While this isn't enough for the full market, it is a notable example that with a bit of infrastructure you can have the AI automatically proving things in a far better way with the context.
It does have issues because GPT has primarily seen Lean 3 code, not Lean 4 (which is relatively newer), and there was a variety of syntax changes + tactic changes. They typically automatically fix themselves, because they have an initial prompt that details the differences, but it does make the generation take longer.
Though I don't expect this to be an issue you can't get around by smart enough prompting. One method would be to have a program that autocorrects common mistakes, ex: writing rw a,
(Lean 3) instead of rw [a]
(Lean 4) so you don't have to do another prompting round. (At least I believe the API lets you alter the prompt, so then it never made the mistake at all).
It would be cool if OpenAI did deliberate training or finetuning of a model with Lean 4. I think you could probably get good results from ChatGPT 3.5 or even GPT 3 if it was just trained better on them.
Related paper https://leandojo.org
Extracts lean theorems into a dataset for theorem provers to use. Then they have a way to hook up different backends (ChatGPT and their ReProver model) to suggest tactics.
Has three videos of chatgpt which are interesting. Similar to Sagredo in some ways, I expect they could make it even more streamlined by just using the API directly.
Not major, but I do think it is a step in the right direction
@NiclasKupper Yes, it gets a human-written proof as input, for example a pdf of a published paper. It can make use of human-written libraries for results that are well-known in the field, but not for any results that the authors of the paper found relevant/novel enough to prove in the paper itself. (If the main result of the paper already has a human-written proof in mathlib, it is probably not a good candidate for resolution of this question.)