In Feb 2022, Paul Christiano wrote: Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025.... My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:
I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I'll go 8% on "gets gold" instead of "solves hardest problem."
Eliezer spent less time revising his prediction, but said (earlier in the discussion):
My probability is at least 16% [on the IMO grand challenge falling], though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I'll stand by a >16% probability of the technical capability existing by end of 2025
So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.
Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.
Related market: https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b
Update: As noted by Paul, the qualifying years for IMO completion are 2023, 2024, and 2025.
Update 2024-06-21: Description formatting
Update 2024-07-25: Changed title from "by 2025" to "by the end of 2025" for clarity
People are also trading
@EliezerYudkowsky how likely is it that you will rule on this question, do you think, particularly in edge cases, eg where a model gets gold but it's unclear how much computation time was spent on each question.
@NathanpmYoung It was a fairly noticeable debate between myself and Christiano so I expect we'll figure out how that settled from our own viewpoints, and hopefully write that up in enough detail for this market to use.
@Austin Given the spread that's developed between this market at other IMO 2025 markets on manifold I think it would be good if you'd answer Nathan's questions below:
https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat#csfur0rer68
I'd also want clarity on how you plan to resolve if Eliezer or Paul say something like the following in, say, December, "I think that imo gold level problem solving capability exists within the top labs, but the systems they've demoed may have been trained after IMO 2025 and/or may have contaminated data".
How this market will resolve is made very confusing by the fact what Eliezer said in his original comment and the language in Paul's later post are different.
@MalachiteEagle these are open source efforts. They have great progress for math-oriented LLMs but still fall short of state of the art proprietary solutions. AIMO2 problem difficulty is a level below IMO
@MalachiteEagle It cannot do combinatorics, and last year it arguably got lucky with the functional equation at #6. Can it do all kinds of algebra problems? Also 4.5 hours, last time it took days.
@nathanwei you don't think o4 is going to be good at combinatorics? I haven't seen much info on o3's capabilities in this domain
@MalachiteEagle I mean o3 completely sucks at olympiads - see https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat#e41lre4y806 - so I was thinking more AlphaProof than o4.
@nathanwei that's o3-mini though right? If OpenAI does an attempt they're not likely to use such restricted inference compute
@MalachiteEagle even with a lot of inference compute o4 would probably still suck at a bunch of types of problems.
@Bayesian I'm willing to believe that's true, but I think it's plausible that o4 is going to be good enough to get gold on many olympiad problems
@MalachiteEagle "get gold on many olympiad problems" do you just mean it'll be good enough to solve some problems? if so I agree
@Bayesian yeah I meant accurately solving problems in a way that doesn't look too weird, like not 500 pages of equations or something