Test set: Performance on a pre-existing IMO is acceptable only if the developers claim it was not in the training set, or compelling third party evidence of this is found. Queries must be posed as to IMO participants i.e. with natural language and image input.
Model details: Publicly available means query-able by the public and/or API access. The model need not be open-weights. No internet search allowed. Arbitrary scaffolding, search, program use etc. allowed. Multi-modal systems count as LLMs. If there's a modular system which is part-LM, part-prover if the prover uses parameters which are also back-propagated through during natural language pre-training then it counts as an LM. If there's significant uncertainty about whether this is true e.g. if there's mixed reporting on how modular a closed-source model is, then I will wait to resolve for up to a year. Wall-clock runtime (or effective serial run-time if parallel calls are used) must be less than IMO time i.e. 9 hours for 6 questions.
I will take into account feedback on resolution criteria until September 2024, after which I will try to keep changes to resolution criteria minimal.
Thanks clarified. For now I put "Wall-clock runtime (or effective serial run-time if parallel calls are used) must be less than IMO time i.e. 9 hours for 6 questions." Not sure what the easiest way to operationalize this is given I assume search would be done in parallel up to some rate limit in practice. Seems to me either this way or an arbitrary total compute budget should be chosen i.e. $1000 worth of tokens. Open to opinion here.
The LLM should be given 4.5 hours to work on the day 1 problems and then 4.5 hours to work on the day 2 problems, as in the real IMO. All problems at once would give it an unfair advantage compared to humans, because if it e.g. turns out that a problem on day 1 is incredibly hard then it would be preferably to use that day 1 time on day 2 problems.
My motivation on time constraint is primarily to avoid false-positives off publicity stunts where a company throws an absurd amount of test-time compute at the IMO.
I want to have minimally detailed requirements so that we don't face issues where no one ran the precise test stipulated. While I agree that the joint 9 hour limit gives the AI some advantage, I'll keep resolution as is to avoid being overly specific.