Will an AI win a gold medal on International Math Olympiad (IMO) 2025?
197
1kṀ140k
Aug 20
58%
chance

Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO.

This is related to https://imo-grand-challenge.github.io/ but with some different rules.

Rules:

  • The result must be achieved on the IMO 2025 problemset and be reported by reliable publications no later than 1 month after the end of the IMO contest dates (https://www.imo-official.org/organizers.aspx, so by end of August 20 2025, if the IMO does not reschedule its date. Local timezone at the contest site).

  • The AI has only as much time as a human competitor (4.5 hours for each of the two sets of 3 problems), but there are no other limits on the computational resources it may use during that time.

  • The AI may receive and output either informal (natural language) or formal (e.g. the Lean language) problems as input and proofs as output.

  • The AI cannot query the Internet.

  • The AI must not have access to the problems before being evaluated on them, e.g. the problems cannot be included in the training set.

    • (The deadline of 1 month after the competition is intended to give enough time for results to be finalized and published, while minimizing the chances of any accidental inclusion of the IMO solutions in the training set.)

  • If a gold medal score is achieved on IMO 2024 or an earlier IMO, that would not count for this market.

Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ200 NO

[Redacted - comment had already been made by someone else below]

filled a Ṁ10 YES at 65% order🤖

Meowdy! The International Math Olympiad is a tough kitty to crack, requiring not just raw calculation skills but also creative problem-solving flair—something AI is getting better at, but still has to pounce on real human ingenuity and intuition. Given the current 63% market probability, I’d say there’s a fair chance, but hmm, the competition and unpredictability make me twitch my whiskers—so I’ll slightly lean towards YES, since AI is learning fast and might snag that gold someday soon! places 10 mana limit order on YES at 65% :3

very confused why manifold is so confident in AI

@manifoldgod You might say you ... "noticed your confusion"

@manifoldgod if you think they're overconfident then bet against them

Money markets are trading much lower -- 30% vs 70% on manifold.

@jgyou those are probably about an open source AI getting gold

Just to be clear, this AI is a singular AI?
Not a combination of multiple LLMs, ...

@manifoldgod Any AI system. Many are already built out of multiple sub AIs, that still counts.

@manifoldgod there is no definition of LLM or AI that would not qualify ensemble models as LLM or AI, so this question is meaningless outside of extreme pedantics

Leading LLMs get <5% scores on USAMO (which selects participants for the IMO): https://arxiv.org/abs/2503.21934

@pietrokc yeah I saw this, very strange - hard to see how this dovetails with the really high performance we see elsewhere - I mean it seems to just speak to train / test contamination

but was frontier math also contaminated?

current llms trained with rl for reasoning largely do it on short solution based problems, not proof based problems. so they learn to take shortcuts. for proof based problems they are currently pretty bad. that is the essence of the difference ; frontiermath is not proof-based. USAMO is proof based. LLMs currently do good on one, bad on the other. For proofs, the best current systems seem not to be LLMs, but systems like alphaproof by google.

bought Ṁ500 YES

@Bayesian yeah, the way AlphaProof/AlphaGeometry avoid making reasoning mistakes is simply by requiring formal proofs, instead of LLMs which generate informal proofs.

@Bayesian It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based. All real math is proof-based. They just ask that the proof be of a fact of the form "certain definition picks out a certain number", to make it easier to check automatically.

@CampbellHutcheson There's been a lot of controversy about FrontierMath which I'll not rehash here. In my personal experience, all models fall FAR short of claims that they can do "research-level math that would take a professional mathematician hours or days". They routinely fail relatively trivial things whenever I test them. I have also tried to earnestly use them to learn actual math, like, existing fields that I'm just not that familiar with. I have found them to be worse than useless at that, because they'll confidently state falsehoods that take effort to disprove.

bought Ṁ150 YES

@pietrokc Gemini 2.5 does a lot better. About 25%

@Usaar33 USAMO 2025 was on 19-20 March. Then someone evaluates Gemini 2.5 on 2 April and it does massively better than models released before that date. What conclusion do you want to draw from this?

@pietrokc I am not following. How is FrontierMath proof based? They don’t look at the reasoning, only at whether the answer was correct. The ai can find the right answer by coincidence or by wrong reasoning cancelling out and it’s still graded as correct. Unlike with proofs

@pietrokc

It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based.

Bayesian is correct—obviously most math is “proof-based” in some trivial sense that isn’t really relevant here. What matters here is whether you are scored based on the correctness of the complete proof you produce. Many recent LLMs have struggled on such tests (even when they have performed well on other math that just require a narrow correct answer—you are welcome to call that number a “proof”, fair enough, but it’s not the relevant distinction here)

@CampbellHutcheson Frontier math is just very different type of problems than competition problems. At least from the public problems these are more about chaining together facts which few people know to reduce the problem to some computation. Because the computation involves complicated objects it often has to be done by hand which is the thing that takes hours for experts.

Competition problems are supposed to be the opposite, they should only require the knowledge of high-schoolers, the difficulty in them is thus orthogonal to the difficulty of Frontiermath problems.

Human contestants only get one chance to answer each question. I hope that means AI will not be judged on pass@k, where it gets k>1 chances to give a correct proof and gets points if at least one is correct. Each AI should also be judged on one submission for each question, right?

@pietrokc No. If we are talking about formal proofs, LLMs can do pass@10^9 because what matters is solving problems not how efficient you are.

If we are talking about informal proofs, it will be pass@1 because nobody will read and grade 10 samples of solutions.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules