
Competitive programming: Ranked 21st of 100k people.
Won part of day three.
Now long this market.
There are probably enough math textbooks in the world to do this. Someone will spend the 8 figure budget to do this, and soon.

(Note that the Turing test remains trivial but not popular to solve, for obvious reasons, but ChatGPT did almost nothing of note other than throw money at a 5-year-old paper and scale their cluster and have caught up to above-median IQ, with some savant abilities.)

@Gigacasting I think your pervious arguments about why IMO gold is far off still hold water. "Find the character that occurs in both the first half and second half of a string" is a far cry from an IMO problem, if anything these AoC problems look easier than AMC12.

While there was nothing “NEW” in the Cicero paper, it was the heaviest combination of techniques I’ve ever seen outside competition—and a sign perhaps research labs are finally diving in to single tasks that line up with obscure interests (e.g. math olympiads), vs. raw scaling.
(Though somehow Meta/Deepmind ship 10x more than OpenAI.)
Updating more in-line with Metaculus, 2024 is possible, 2027 not unreasonable. (Note this is a lot easier than generating verbal IMO proofs—so won’t exactly be elegant until those are built.)
Damn, just noticed how autistic I am. Spent ~2 minutes trying to confirm that the AI doesn't actually have to participate in the games themselves.


Meta AI has built a neural theorem prover that has solved 10 International Math Olympiad (IMO) problems — 5x more than any previous AI system.
That's still only about 3% of all IMO problems.

“Never underestimate the stupidity of the average or the brilliance of the extraordinary.”
- someone famous probably
Continue to expect basic Turing tests will be passed long before a math competition that maybe a double-digit number of people per year (~0.002%) can even solve any of.
—
(That said, this is a vastly more clever paper than anything else in the area.)
It includes 2025, right? Then it should say by the end of 2025.
35% is ridiculously high. Metaculus is on ~25%, and even this feels highish.
@ValeryCherepanov Upd: I was thinking by end of 2024, not by end of 2025 and it got a bit higher, so Metaculus is at 37% now. I still think it's too high.




https://bounded-regret.ghost.io/ai-forecasting-one-year-in/
This is from June - great article on hypermind forecasts for AI progress, and how the progress on the MATH dataset 1 year in was far faster than predicted.

Updating even lower
“We encode a problem’s difficulty level from ‘1’ to ‘5’... while AIME problems are level 5. “
“A participant who got a perfect score on the AMC 10 exam and attended USAMO several times got 18/20”
—
The forecasters frankly didn’t know what they were doing. The bottom half of AMC 8/10/12 are solvable by normal people, (50% on this dataset is trivial).
The gap between the ~90% scorer (roughly my profile, easy to do for someone in competitive math, could see AI doing by ~2025) and an IMO Gold is so vast that see this as <1% on any actual IMO scoring rubric—where the AI must actually prove the statement.
Somehow people think the “Turing test” is harder than the upper-upper-upper-upper end competitive math 🤔

Prediction: video summarization, video generation, Turing tests of any reasonable criteria long before the IMO proofs

@Gigacasting The gap is vast for humans. For AI, I think the gap is negligible. Getting to the point where AI can solve basic math problems already gets you almost all of the way to the point where you can solve IMO problems, I believe.

The gap between GPT-3, aka average verbal ability with loads of knowledge, and writing a philosophy Ph. D. thesis is vast. (Still think the Turing test could be passed in a month of hacking, but it’s a long way to long-form, high-memory coherent content.)
While there could be a gimmick, the median score on the USAMO, an invitational of the top ~250 in the country, is exactly zero. And multiple choice problem solving has almost nothing in common with proofs.
The gap from here to MATH score of 90% is much smaller than the next gap to the proofs.

Good luck to the AI with these.
(the MATH dataset was basically “predict one token”, and mostly multiple choice)



Who is bidding this up 🤔
AI can’t even get 90th percentile on the SAT—under any reasonable resolution it has to write human-readable proofs (not just brute-force “true” “false” about an equation)
Somewhat surprised there are IMO participants claiming it would be “easy” to solve many problems; maybe a curse of their own knowledge or naïveté about proof vs raw calculation (or a sign maybe the AIME was the real IMO all along, because a machine can’t just brute-force an inequality)
@Gigacasting I put money into "YES" under four assumptions: (1) Progress in AI almost always surprises everybody. It is still surprising to us how much can be achieved by training a rudimentary model on a large dataset with a TPU Pod for a few days. (2) There is a lot of overlap between people interested in proving theorems and people interested in AI, especially now that everyone and their grandmother are able to rent GPUs and TPUs for $6/hr. (3) Solving natural language math problems with AI was a trendy topic in academic deep learning last year, and I expect at least a few groups to stick with it until 2024. (4) Automated theorem proving already exists, albeit it is of exponential computational complexity. So like Go, which is (almost) of exponential complexity, I expect that finding a proof is fundamentally an efficient tree search problem, and thus amenable to known training methods. Thus I expect that within a year or three of someone putting together a training dataset this problem will fall to AI.
All that said, reasons to bet "NO": (1) The aforementioned training set might not exist yet, or training examples might not have enough variety. When training on images we do image augmentation, but how do you augment a proof? (2) Progress on ImageNet took about 3 years after the dataset existed; 2025 is now two years away. (3) There is a larger population of IMO competitors than competitive Go players, and thus human Gold winners in IMO are likely more extreme outliers than human competitors in Go. This means more training will be required for AI. (4) Encoding Go game state for training was very much an image problem, whereas IMO problems don't fit easily into common natural language encoders. It is likely that attention is NOT all you need for this problem, as human contestants will often read problems many times. (5) Actually, automated theorem proving is NOT of roughly exponential complexity like Go. Go is of subexponential complexity, while ATP can sometimes be uncomputable. (6) Most importantly, IMO solutions are graded by humans. If the IMO board does not accept AI proofs to be mixed in with the solutions as a test, then this is guaranteed to resolve "NO."
Note also that unrelated to the difficulty of the problem, the "YES" result has a shorter time to expected payout, and thus it is going to be overvalued compared to "NO".
@brp Minor progress against assumption "NO(1)": https://aclanthology.org/2022.naacl-main.310.pdf "Practice Makes a Solver Perfect: Data Augmentation for Math Word Problem Solvers." Mind that this is data augmentation for word problems of the standardized test variety, not for those in proof-based math competitions.

Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.
when it resolves as NO?


The bar to imitate a median person is so vastly below the level necessary to be in the top 0.001% at mathematics (these models aren’t yet near the top decile at math).
Massive incentives to do cool stuff like proofs and not reveal their imitative skills, but tend to think average human happens way before this, and the “top STEM grad student” Turing might even be easier than this.

An AI scoring above 10+ on the AIME seems plausible in this timeframe (massive speed advantage, numerical answer, very limited set of knowledge required); would become optimistic once there are public models acing SATs/AMCs that this could happen within another year or two.
@JoyVoid Also, I feel like predictor on AIs haven't updated on their updates yet. If you're consistently under the target, even when you correct every time, that means the target is moving/accelerating faster than you thought













