How many years between an AI scoring bronze on an IMO and an AI solving half of a Miklós Schweitzer exam?

The AI may do retrieval over the internet for the Miklos case, but may not do retrieval for the IMO case. A combination of two models, e.g. a formalizer and a solver, would also qualify.

Any amount of prompt-engineering, rejection sampling, access to auxiliary systems is fine. A prompt which includes e.g. related IMO problems would however be disqualified.

Sort by:
dp avatar
dpbought Ṁ5 of HIGHER

My rationale for HIGHER is; although I can't really guarantee it moves the gap to 4 years. It is also unclear how much retrieval will help -- maybe future models rely on querying the Internet embeddings a lot.

JacobPfau avatar
Jacob Pfau

@dp Interesting, do you think there's an easy-to-describe benchmark which dodges this failure mode?
Separately, how likely do you think this sort of direct 'bashing' setup is likely to actually be constructed? There's a rather short time window for that to happen, before a more general LM-like solver is made.

dp avatar
dpis predicting HIGHER at 20%

@JacobPfau I don't have an inside view on whether any lab wants to do this. AlphaCode team made a large non-generalizable technical effort to beat some benchmarks; if there is a similar drive to just improve the numbers on math performance, I'd say the probability is >10% that the solution gives the language model query access to some "bashing" solvers.

As for an easy-to-describe benchmark you want, and the related issues with both calculator (or Python) and no-calculator math benchmarks, maybe I'll have some researchy stuff to say about it in the future... but my ideas are too unpolished to communicate right now.