Will an AI win a gold medal on International Math Olympiad (IMO) 2025?
Basic
30
3.5k
2025
56%
chance

Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO.

This is related to https://imo-grand-challenge.github.io/ but with some different rules.

Rules:

  • The result must be achieved on the IMO 2025 problemset and be reported by reliable publications no later than 1 month after the end of the IMO contest dates (https://www.imo-official.org/organizers.aspx, so by end of August 20 2025, if the IMO does not reschedule its date. Local timezone at the contest site).

  • The AI has only as much time as a human competitor (4.5 hours for each of the two sets of 3 problems), but there are no other limits on the computational resources it may use during that time.

  • The AI may receive and output either informal (natural language) or formal (e.g. the Lean language) problems as input and proofs as output.

  • The AI cannot query the Internet.

  • The AI must not have access to the problems before being evaluated on them, e.g. the problems cannot be included in the training set.

    • (The deadline of 1 month after the competition is intended to give enough time for results to be finalized and published, while minimizing the chances of any accidental inclusion of the IMO solutions in the training set.)

  • If a gold medal score is achieved on IMO 2024 or an earlier IMO, that would not count for this market.

Get Ṁ1,000 play money
Sort by:
opened a Ṁ2,000 NO at 59% order

2000 limit no at 59%

“no later than 1 month after the end of the IMO contest (so by end of August 20 2025, if the IMO does not reschedule its date).”

  1. What time zone?

  2. Even though the timeline on the website has like 10 days, the actual contest is some time in the middle of those dates, so there’s technically a month and a few days where the problems are available.

Good questions.

  1. Local timezone at contest site.

  2. I'm going to use the end of the IMO dates as written, from https://www.imo-official.org/organizers.aspx even though the actual contest is in the middle, because that's what I wrote and it doesn't really matter the exact number of days.

bought Ṁ350 YES

So by these criteria, it's fine if the AI isn't finalized before the IMO, as long as it doesn't train on the IMO problems? This seems like it opens the possibility for small tweaks to the program to be made that bias the algorithm to be better at some tasks than others, and for the nature of these tweaks to depend on the content of the problems.

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

This is also unlikely to be something the public or "reliable publications" could verify (hence the open source requirement for the IMO Grand Challenge), so it seems we'd just be taking the AI developer's word for it.

Note that in a lot of IMO criteria, like Eliezer's, the AI can be produced long after the contest and you mostly just have to trust the AI developers on whether they cheated.

While you can run multiple versions, you could already do that anyway, the only difference is that you might have humans decide different tweaks to try based on the problems (juicing the evals) or sort of cheat the time limits by not counting the time used for earlier versions you tried. So at least the cheats are much more limited.

Most models are closed and it is quite likely that the model will never be published, unless they are specifically going for the IMO grand challenge. So it's very hard to set requirements around the AI being finalized before the competition, unless you have an open model requirement.

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

And if that worked, then your winning AI system is just the collection of versions of subagents. (Assuming as mentioned above that you don't have humans deciding the tweaks based on the questions, and that it isn't cheating the time controls)

Overall I think my criteria balance false positive (cheating) and false negative potential about as well as possible. I haven't seen or thought of any verification requirements that would have prevented the hypothetical cheating scenarios above and allowed IMO silver to resolve yes on the deep mind announcement (if it had met the time controls), and I definitely want my question to resolve yes on that

@jack

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

We are probably referring to different levels of model capabilities. I see a lot of probability mass on models that are correct, say, 5-50% of the time.

I'd agree that trying to resolve YES on the recent GDM announcement makes it hard to use strict criteria.

Yeah, I was referring to capability at the level of alpha proof right now.

Alpha proof is already trying tons of different proof strategies and checking to see what works!

Similar to this but with better, clearer resolution criteria and earlier deadline

See also recent news https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

And another market on whether an AI will hit 1st place on the IMO: