In Feb 2022, Paul Christiano wrote: Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025.... My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:
I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem" where "hardest problem" = "usually problem #6, but use problem #3 instead if either: (i) problem 6 is geo or (ii) problem 3 is combinatorics and problem 6 is algebra." (Would prefer just pick the hardest problem after seeing the test but seems better to commit to a procedure.)
Maybe I'll go 8% on "gets gold" instead of "solves hardest problem."
Eliezer spent less time revising his prediction, but said (earlier in the discussion):
My probability is at least 16% [on the IMO grand challenge falling], though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more. Paul?
EDIT: I see they want to demand that the AI be open-sourced publicly before the first day of the IMO, which unfortunately sounds like the sort of foolish little real-world obstacle which can prevent a proposition like this from being judged true even where the technical capability exists. I'll stand by a >16% probability of the technical capability existing by end of 2025
So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.
Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.
Related market: https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b
Update: As noted by Paul, the qualifying years for IMO completion are 2023, 2024, and 2025.
Update 2024-06-21: Description formatting
Update 2024-07-25: Changed title from "by 2025" to "by the end of 2025" for clarity
@AdamK OK, so who's benchmarking o3-mini against the 2024 IMO? We could have results within the week.
@AdamK Just checking for agreement on resolution criteria: I believe that this can only resolve on the actual 2025 IMO. So if you give o3-mini the 2024 IMO and it gets every question right, that will not count toward a YES resolution. Is that your understanding as well?
@EricNeyman Disagree. The original dialogue mentions any of "the 2022, 2023, 2024, or 2025 IMO". The fact that Eliezer + Paul were allowing the system to be run until the end of 2025 suggests that they are fine with retrodicting a model against the most recent IMO as long as the model had never seen those questions. I thus think benchmarking o3 against the 2024 IMO should count (assuming it gets Gold-medal performance within time constraints) if OAI can confirm directly or indirectly that o3 hadn't seen those problems.
@AdamK Quoting the description:
"'I'd put 4% on "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to solve the single hardest problem'...
So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls etc. of grand challenge) in one of 2022-2025.
Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task."
That's why I think o3 getting gold on 2024 IMO wouldn't count. Do you still disagree?
@EricNeyman My understanding is that this market depends on whether Paul/Eliezer agree that the feat of IMO Gold was accomplished. If OAI clarifies that o3-mini has never seen the 2024 IMO, and the system gets Gold, I would call on Eliezer and/or Paul to say their bet has been settled, at least in spirit. Again, assuming o3 actually scores a Gold within time constraints, I don't think there's any point in them waiting another 5 months to make a statement.
However, if they want to wait to resolve until after the 2025 IMO, that's fine by me as well.
@AdamK Huh, that's confusing. Suppose for instance that o3 scores a gold on the 2024 IMO, but no model built before the 2025 IMO scores a gold on the 2025 IMO. It seems pretty unambiguous to me that the market ought to resolve NO, given that the "build before the IMO" clause wasn't satisfied. I don't really see a reading of the question under which the market ought to resolve YES.
Or is your claim more like: "I'm very confident that if o3 scores a gold on the 2024 IMO, then the market will resolve YES after the 2025 IMO, so we might as well resolve it now"? (If so, I don't think that such evidence would meet my confidence bar for resolving the market now.)
(Also oops, didn't mean to thumbs-down your latest comment, pressed that button by accident.)
@AdamK The market relies on what Paul and Elizier say. So they could always decide to resolve early for some reason.
However, I would guess based on their wording they won't resolve for AI trained after the IMO in question. Any AI trained after the IMO could have both the questions and answers from the IMO in their training data, and it's easy for a current AI to reproduce an answer directly from its training data. The point of the question is to see whether the AI can solve novel problems that it hasn't been trained on, and the only way you can ensure that is by taking a AI trained before the problems were released publicly.
I think there's some chance (but not certain) Eliezer says this should resolve YES after the 2025 IMO, because he phrased things as the "technical capability exisiting at the end of 2025", but I think it's unlikely he jumps the gun before the 2025 IMO.
@EricNeyman It's definitely up to Paul / Eliezer, and I won't speak on either of their behalves.
My opinion is that it is kosher to retrodict a new model against the most recent competition, as long as it hasn't seen the questions, and conclude "An AI got Gold in this Olympiad," and "the technical capability exists" for an AI to get a Gold Medal on the IMO. This is, for instance, how the resolution criteria are set up for the Metaculus 2025 IOI question. I agree that this market shouldn't resolve yet if we decide to interpret the "before the IMO" clauses strictly. With loans, I lose little if this market doesn't resolve soon, and NO traders might even burn more mana on cope.
@MaxMorehead None at all, my trading is random, so people should be eager to compete to give me good prices
@BaryLevy A hypothetical AI which was trained only on data from before IMO 2025, was built before the IMO, and got gold on IMO 2025 but wasn't released/announced until after 2025 IMO would cause this question to resolve yes, assuming Paul/Eliezer agree there's no data contamination. If it's ambiguous when exactly the AI was built, but it's clear it was only trained on pre-IMO 2025 data, Paul may decide to concede the bet anyways.
Insofar as people think a IMO-gold AI is a close precursor to powerful AGI, you might expect the company who made it to keep it secret for longer.
https://arxiv.org/abs/2410.05229
Apple researchers have developed variants of the GSM-8K benchmark to assess mathematical reasoning of LLMs. They concluded LLMs cannot reason mathematically; it’s sophisticated pattern matching.
@CozmicK I expect that the AI that accomplishes this won't be just an LLM, though that could be one component.
AlphaProof is very close to accomplishing this goal. It's gold-medal level on geometry questions, and silver-medal level overall: https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
If I'm understanding it correctly, the resolution criteria for this market is IMO Grand Challenge minus the open-source criterion. This means that AI must receive formal representation of the problem and must output a formal solution in Lean. Is that so? @Austin
@Mothmatic Paul said he would concede the bet independent of whether the input or output is natural language or a formal language (below in the comments).
@DanielPCamara You cannot get gold on a old math olympiad. This is for the International Math Olympiad, not regional versions.