
This question aims to test whether AI systems will approach the breadth and maturity of the professional mathematical community before 2028.
Currently, the On-line Encyclopedia of Integer Sequences (OEIS) collects several dozen new sequences per day, meaning up to thousands of new sequences can be collected for AI evals. In mid-2027, I will create or sponsor efforts (or defer to sufficiently high-quality outside evaluations or benchmarks) to determine how well frontier AI systems can suggest reasonable and mathematically sound explanations for new OEIS sequences (when given terms from the sequence, and possibly hints about the domain of the intended explanation).
This question resolves YES if the best AI systems match or outperform a baseline on this task involving groups of professional human mathematicians working together across several hours, with access to outside references.
Some more details on judging:
I will generally accept outcomes where AI systems mostly give their answers from a subset of mathematics (combinatorics, number theory, algebra, etc), as long as they are able to provide answers judged as acceptably important, relevant, and elegant, in consultation with the mathematical community.
Generally speaking, I will lean towards NO if a successful system appears to lack the breadth of human mathematicians in large ways.
I am open to suggestions from commenters on ways to adjust the resolution criteria to better fit its aim.