How many years between an AI scoring bronze on an IMO and an AI solving half of a Miklós Schweitzer exam?
9
79
440
2030
4
expected

The AI may do retrieval over the internet for the Miklos case, but may not do retrieval for the IMO case. A combination of two models, e.g. a formalizer and a solver, would also qualify.

Any amount of prompt-engineering, rejection sampling, access to auxiliary systems is fine. A prompt which includes e.g. related IMO problems would however be disqualified.

Get Ṁ200 play money
Sort by:
bought Ṁ5 of HIGHER

My rationale for HIGHER is https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer/comment/aA9mewsxZHuWfei7q; although I can't really guarantee it moves the gap to 4 years. It is also unclear how much retrieval will help -- maybe future models rely on querying the Internet embeddings a lot.

@dp Interesting, do you think there's an easy-to-describe benchmark which dodges this failure mode?
Separately, how likely do you think this sort of direct 'bashing' setup is likely to actually be constructed? There's a rather short time window for that to happen, before a more general LM-like solver is made.

predicts HIGHER

@JacobPfau I don't have an inside view on whether any lab wants to do this. AlphaCode team made a large non-generalizable technical effort to beat some benchmarks; if there is a similar drive to just improve the numbers on math performance, I'd say the probability is >10% that the solution gives the language model query access to some "bashing" solvers.

As for an easy-to-describe benchmark you want, and the related issues with both calculator (or Python) and no-calculator math benchmarks, maybe I'll have some researchy stuff to say about it in the future... but my ideas are too unpolished to communicate right now.

More related questions