
Test set: Performance on a pre-existing IMO is acceptable only if the developers claim it was not in the training set, or compelling third party evidence of this is found. Queries must be posed as to IMO participants i.e. with natural language and image input.
Model details: Publicly available means query-able by the public and/or API access. The model need not be open-weights. No internet search allowed. Arbitrary scaffolding, search, program use etc. allowed. Multi-modal systems count as LLMs. If there's a modular system which is part-LM, part-prover if the prover uses parameters which are also back-propagated through during natural language pre-training then it counts as an LM. If there's significant uncertainty about whether this is true e.g. if there's mixed reporting on how modular a closed-source model is, then I will wait to resolve for up to a year. Wall-clock runtime (or effective serial run-time if parallel calls are used) must be less than IMO time i.e. 9 hours for 6 questions.
I will take into account feedback on resolution criteria until September 2024, after which I will try to keep changes to resolution criteria minimal.