Will an AI be able to convert recent mathematical results into a fully formal proofs that can be verified by a mainstream proof assistant by 2025?

1kṀ8327

resolved Jan 2

Resolved

ALL

Definitions used:

Under "AI" I consider any fully autonomous computer program that is not given any human input other than the mathematic
Under "recent mathematical results" I consider novel mathematical results that have been published during or after 2010.
Under "fully formal proof" I consider a sequence of proof steps written in a formal mathematical logic such as CIC or ZFC.
Under "mainstream proof assistant" I consider languages such as Isabelle/HOL, Coq, or Lean.

This question will resolve to YES if such an AI can translate and verify at least five recent mathematical results that had not previously been verified in a proof assistant before the start of 2025, and will otherwise resolve to NO.

Math

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ460
2		Ṁ264
3		Ṁ244
4		Ṁ173
5		Ṁ112

People are also trading

In 2029, will any AI be able to take an arbitrary proof in the mathematical literature and translate it into a form suitable for symbolic verification? (Gary Marcus benchmark #5)

80% chance

Will an AI solve any important mathematical conjecture before January 1st, 2030?

80% chance

Will the majority of mathematicians rely on formal computer proof assistants before the end of 2040?

60% chance

Will an AI model write the proof to the Riemann Hypothesis by the end of 2025?

2% chance

Will natural language based proof assistants be in common use by 2026?

23% chance

By 2030, AI can autonomously prove mathematical theorems that are publishable in mathematics journals today?

73% chance

Will aesop be able to replace >50% of mathlib proofs by 2025-11-26?

41% chance

Will an AI implement a full infinity-topos mathematical framework by end of 2025?

17% chance

Will AIs be widely recognized as having developed a new, innovative, foundational mathematical theory before 2035?

50% chance

Will AIs be widely recognized as having developed a new, innovative, foundational mathematical theory before 2030?

Sort by:

I feel like this one is a close call. Did someone try o1 pro and got more than a 10% failure rate? What about o3?

A lot of progress has been made on LLMs using Lean in the past 6 months.

@NathanHelmBurger I have not tried anything here, but I think that NO is almost certainly correct. To formalize a recent mathematical result in a given paper, I think one almost always has to formalize lots of background results in references. I haven't really seen anything that purports to be able to crawl through the tree of background results like that, successfully formalizing everything it needs.

@BoltonBailey More to the point, the criteria say "AI can translate and verify at least five recent mathematical results that had not previously been verified". Is there any specific claim about which five results these were that should make this market resolve YES?

bought Ṁ150 YES

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/ though will need more time to see how far this can get on this market.

AI achieves silver-medal standard solving International Mathematical Olympiad problems

Breakthrough models AlphaProof and AlphaGeometry 2 solve advanced reasoning problems in mathematics

Is one of the conditions on "recent mathematical result" that the result must be originally proven by a human? Or does something like this count? https://arxiv.org/abs/1605.00723

Mathematical Genius

Numbers dance within my head,
Formulas swirl in a deathly bed,
Equations pulsate with life anew,
And yet I still can't divide by two.

Under "AI" I consider any fully autonomous computer program that is not given any human input other than the mathematic

I think what you really want is a desiderata that they haven't guided it more than 'natural'. Like, they may have trained the model on a bunch of human written proofs -> formal language, but they haven't specifically given it this proof or any guidance specific to it.

-----

My view is that this could be done this year if someone tried, and that the main limitation is that (some organization) just hasn't strongly trained on a theorem prover yet. MetaAI did a bit of Lean stuff, but not a big model or anywhere near as generic and powerful as GPT.

Like, GPT-3 and friends knows Lean to a not terrible degree, but it is significantly more confused about Lean than it is JavaScript. This is primarily due to the (far lesser) amount of training data, but (some org) could hire a bunch of programmers

to write new Lean proofs
write new Lean proofs with extra finetuning from the human written versions
rewrite existing proofs in multiple ways to give it better generalization (and, the Lean community kinda likes golfing too much which obscures the meaning even for humans, and so probably makes this even more worthwhile)

Which would be good ways to figuratively level-up the understanding quite a bit.

Then there's the fact that Lean, and other proof assistants, have a context that you use to know what the current variables/proofs/etc are in the scope. Existing GPT and friends are writing Lean without the context, which would actually be very hard to do! Especially since various tactics would change what proof a hypothesis has, with that change only being visible from your knowledge of the tactic or through the context.

Essentially: A project could get far more competent Lean results by training with the context. This would also help guard against changes in tactic behavior (which do happen, and is a part of the confusion).

(The org could also do wackier things, like giving it the ability to query what types are available in a package -> filter them for relevant results to the problem so that it definitely fits in context -> apply them; this avoids hallucinating - which is a problem with Lean code gen currently - and makes it more dynamic)

predictedYES

(context bits)

Sagredo, a recently in-development Lean tactic, mixes GPT-4 with Lean 4 in a sort of dialogue. See: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/sagredo (there's a video there, but the video is unlisted so unsure if I'm supposed to link it directly)

Sagredo gives the user's theorem statement (hypothesis parameters, and goal) to GPT-4, telling it to continue the proof. Then, whenever it stops it gives GPT-4 the current context state or an error message if there is one and asks it to continue/fix-the-error.

While this isn't enough for the full market, it is a notable example that with a bit of infrastructure you can have the AI automatically proving things in a far better way with the context.

It does have issues because GPT has primarily seen Lean 3 code, not Lean 4 (which is relatively newer), and there was a variety of syntax changes + tactic changes. They typically automatically fix themselves, because they have an initial prompt that details the differences, but it does make the generation take longer.

Though I don't expect this to be an issue you can't get around by smart enough prompting. One method would be to have a program that autocorrects common mistakes, ex: writing rw a, (Lean 3) instead of rw [a] (Lean 4) so you don't have to do another prompting round. (At least I believe the API lets you alter the prompt, so then it never made the mistake at all).

It would be cool if OpenAI did deliberate training or finetuning of a model with Lean 4. I think you could probably get good results from ChatGPT 3.5 or even GPT 3 if it was just trained better on them.

predictedYES