Will an AI get gold on any International Math Olympiad by 2025?
23%
chance
An opportunity to join in on https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer Eliezer has this at >16%; Paul at <8%. Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task. Related market: https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b
Sort by:
Gigacasting avatar
Gigacasting
is predicting YES at 32%

Competitive programming: Ranked 21st of 100k people.

Won part of day three.

Now long this market.

There are probably enough math textbooks in the world to do this. Someone will spend the 8 figure budget to do this, and soon.

Gigacasting avatar
Gigacasting
is predicting YES at 32%

(Note that the Turing test remains trivial but not popular to solve, for obvious reasons, but ChatGPT did almost nothing of note other than throw money at a 5-year-old paper and scale their cluster and have caught up to above-median IQ, with some savant abilities.)

BoltonBailey avatar
Bolton Bailey
bought Ṁ100 of NO

@Gigacasting I think your pervious arguments about why IMO gold is far off still hold water. "Find the character that occurs in both the first half and second half of a string" is a far cry from an IMO problem, if anything these AoC problems look easier than AMC12.

Gigacasting avatar
Gigacasting
is predicting YES at 25%

While there was nothing “NEW” in the Cicero paper, it was the heaviest combination of techniques I’ve ever seen outside competition—and a sign perhaps research labs are finally diving in to single tasks that line up with obscure interests (e.g. math olympiads), vs. raw scaling.

(Though somehow Meta/Deepmind ship 10x more than OpenAI.)

Updating more in-line with Metaculus, 2024 is possible, 2027 not unreasonable. (Note this is a lot easier than generating verbal IMO proofs—so won’t exactly be elegant until those are built.)

GG avatar

Thought this had already happened tbh.

EmrikGarden avatar

Damn, just noticed how autistic I am. Spent ~2 minutes trying to confirm that the AI doesn't actually have to participate in the games themselves.

jack avatar
Jack
is predicting YES at 36%
o avatar
Orpheus
is predicting NO at 37%
Yev avatar
Yev
is predicting NO at 37%

Meta AI has built a neural theorem prover that has solved 10 International Math Olympiad (IMO) problems — 5x more than any previous AI system.

That's still only about 3% of all IMO problems.

Gigacasting avatar

“Never underestimate the stupidity of the average or the brilliance of the extraordinary.”

- someone famous probably

Continue to expect basic Turing tests will be passed long before a math competition that maybe a double-digit number of people per year (~0.002%) can even solve any of.

(That said, this is a vastly more clever paper than anything else in the area.)

ValeryCherepanov avatar
Valery Cherepanov
is predicting NO at 34%

It includes 2025, right? Then it should say by the end of 2025.

ValeryCherepanov avatar
Valery Cherepanov
bought Ṁ10 of NO

35% is ridiculously high. Metaculus is on ~25%, and even this feels highish.

ValeryCherepanov avatar
Valery Cherepanov
is predicting NO at 34%

@ValeryCherepanov Upd: I was thinking by end of 2024, not by end of 2025 and it got a bit higher, so Metaculus is at 37% now. I still think it's too high.

Gigacasting avatar
Gigacasting
is predicting NO at 32%
Gigacasting avatar
Gigacasting
is predicting NO at 32%

And

Gigacasting avatar
Gigacasting
bought Ṁ50 of NO

Enjoy:

jack avatar
Jack
is predicting NO at 44%

https://bounded-regret.ghost.io/ai-forecasting-one-year-in/

This is from June - great article on hypermind forecasts for AI progress, and how the progress on the MATH dataset 1 year in was far faster than predicted.

Gigacasting avatar
Gigacasting
bought Ṁ50 of NO

Updating even lower

“We encode a problem’s difficulty level from ‘1’ to ‘5’... while AIME problems are level 5. “

“A participant who got a perfect score on the AMC 10 exam and attended USAMO several times got 18/20”

The forecasters frankly didn’t know what they were doing. The bottom half of AMC 8/10/12 are solvable by normal people, (50% on this dataset is trivial).

The gap between the ~90% scorer (roughly my profile, easy to do for someone in competitive math, could see AI doing by ~2025) and an IMO Gold is so vast that see this as <1% on any actual IMO scoring rubric—where the AI must actually prove the statement.

Somehow people think the “Turing test” is harder than the upper-upper-upper-upper end competitive math 🤔

Gigacasting avatar
Gigacasting
bought Ṁ250 of NO

Prediction: video summarization, video generation, Turing tests of any reasonable criteria long before the IMO proofs

jack avatar
Jack
bought Ṁ20 of YES

@Gigacasting The gap is vast for humans. For AI, I think the gap is negligible. Getting to the point where AI can solve basic math problems already gets you almost all of the way to the point where you can solve IMO problems, I believe.

Gigacasting avatar
Gigacasting
is predicting NO at 34%

The gap between GPT-3, aka average verbal ability with loads of knowledge, and writing a philosophy Ph. D. thesis is vast. (Still think the Turing test could be passed in a month of hacking, but it’s a long way to long-form, high-memory coherent content.)

While there could be a gimmick, the median score on the USAMO, an invitational of the top ~250 in the country, is exactly zero. And multiple choice problem solving has almost nothing in common with proofs.

The gap from here to MATH score of 90% is much smaller than the next gap to the proofs.

Gigacasting avatar
Gigacasting
bought Ṁ12 of NO

Good luck to the AI with these.

(the MATH dataset was basically “predict one token”, and mostly multiple choice)

Gigacasting avatar
Gigacasting
bought Ṁ180 of NO

Who is bidding this up 🤔

AI can’t even get 90th percentile on the SAT—under any reasonable resolution it has to write human-readable proofs (not just brute-force “true” “false” about an equation)

Somewhat surprised there are IMO participants claiming it would be “easy” to solve many problems; maybe a curse of their own knowledge or naïveté about proof vs raw calculation (or a sign maybe the AIME was the real IMO all along, because a machine can’t just brute-force an inequality)

brp avatar

@Gigacasting I put money into "YES" under four assumptions: (1) Progress in AI almost always surprises everybody. It is still surprising to us how much can be achieved by training a rudimentary model on a large dataset with a TPU Pod for a few days. (2) There is a lot of overlap between people interested in proving theorems and people interested in AI, especially now that everyone and their grandmother are able to rent GPUs and TPUs for $6/hr. (3) Solving natural language math problems with AI was a trendy topic in academic deep learning last year, and I expect at least a few groups to stick with it until 2024. (4) Automated theorem proving already exists, albeit it is of exponential computational complexity. So like Go, which is (almost) of exponential complexity, I expect that finding a proof is fundamentally an efficient tree search problem, and thus amenable to known training methods. Thus I expect that within a year or three of someone putting together a training dataset this problem will fall to AI.

All that said, reasons to bet "NO": (1) The aforementioned training set might not exist yet, or training examples might not have enough variety. When training on images we do image augmentation, but how do you augment a proof? (2) Progress on ImageNet took about 3 years after the dataset existed; 2025 is now two years away. (3) There is a larger population of IMO competitors than competitive Go players, and thus human Gold winners in IMO are likely more extreme outliers than human competitors in Go. This means more training will be required for AI. (4) Encoding Go game state for training was very much an image problem, whereas IMO problems don't fit easily into common natural language encoders. It is likely that attention is NOT all you need for this problem, as human contestants will often read problems many times. (5) Actually, automated theorem proving is NOT of roughly exponential complexity like Go. Go is of subexponential complexity, while ATP can sometimes be uncomputable. (6) Most importantly, IMO solutions are graded by humans. If the IMO board does not accept AI proofs to be mixed in with the solutions as a test, then this is guaranteed to resolve "NO."

Note also that unrelated to the difficulty of the problem, the "YES" result has a shorter time to expected payout, and thus it is going to be overvalued compared to "NO".

brp avatar

@brp Minor progress against assumption "NO(1)": https://aclanthology.org/2022.naacl-main.310.pdf "Practice Makes a Solver Perfect: Data Augmentation for Math Word Problem Solvers." Mind that this is data augmentation for word problems of the standardized test variety, not for those in proof-based math competitions.

Accountdeletionrequested avatar
Account deletion requested
is predicting NO at 43%

Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.

when it resolves as NO?

jack avatar
Jack
is predicting NO at 36%

It should resolve NO at the end of 2025

Gigacasting avatar
Gigacasting
bought Ṁ50 of NO

The bar to imitate a median person is so vastly below the level necessary to be in the top 0.001% at mathematics (these models aren’t yet near the top decile at math).

Massive incentives to do cool stuff like proofs and not reveal their imitative skills, but tend to think average human happens way before this, and the “top STEM grad student” Turing might even be easier than this.

Gigacasting avatar
Gigacasting
is predicting NO at 41%

An AI scoring above 10+ on the AIME seems plausible in this timeframe (massive speed advantage, numerical answer, very limited set of knowledge required); would become optimistic once there are public models acing SATs/AMCs that this could happen within another year or two.

JoyVoid avatar
joy_void_joy
bought Ṁ465 of YES
JoyVoid avatar
joy_void_joy
is predicting YES at 53%

@JoyVoid Also, I feel like predictor on AIs haven't updated on their updates yet. If you're consistently under the target, even when you correct every time, that means the target is moving/accelerating faster than you thought

JamesGiammona avatar
Given progress with Minerva, I think this should be higher. https://twitter.com/alewkowycz/status/1542559176483823622
Sinclair avatar
See also https://www.metaculus.com/questions/6728/ai-wins-imo-gold-medal/ currently distribution is 25% by 2026, 50% by 2029, 75% by 2037
MP avatar
I don't understand how this bet is settled. I once worked in the IMO (during undergrad, I'm not that smart) and the exam itself is subject to significant subjectivity. 100 teachers take two days LITERALLY locked into a hotel to correct the exams. The only way I think a bet like this could be solved is if the IMO were to agree to send a solution made by an AI to the examiners. I know that THIS market is settled however Eliezer and Paul settle their bets but just sharing my thoughts.
Conflux avatar
Conflux
bought Ṁ50 of NO
This is an Austin market, I doubt it will resolve based on a fake IMO.
NN avatar
N. N.
bought Ṁ100 of NO
Agree w/ Lorenzo, current price is crazy.
Lorenzo avatar
Lorenzo
bought Ṁ129 of NO
No idea what's going on in this market. Eliezer and Paul clearly agree on what the IMO is. Curious about the reasoning of people buying this much YES (well above Elizier >16%)
Duncan avatar
Duncan
bought Ṁ1 of YES
The IMO is probably specific enough, and neither of the founding bettors would question this. Neither this nor the LessWrong post actually uses the trademarked term "International Mathematical Olympiad" though.
MP avatar
@Duncan any imo likely means 2023,2024 or 2025 IMO
Duncan avatar
Duncan
bought Ṁ1 of YES
I guess I'd have to get EY's buy-in first, and that's probably a no-go.
Duncan avatar
Duncan
bought Ṁ1 of YES
So, if I were to set up a math Olympiad with Canadian and American 1st graders, and an AI beat them, would that qualify?
Lorenzo avatar
Lorenzo
bought Ṁ13 of NO
Why is this market at 81% if we have Paul at <8% and Eliezer at >16%? I feel that I'm missing something about the resolution, but still 81% seems high