This question aims to test whether AI systems will approach the breadth and maturity of the professional mathematical community before 2028.

Currently, the On-line Encyclopedia of Integer Sequences (OEIS) collects several dozen new sequences per day, meaning up to thousands of new sequences can be collected for AI evals. In mid-2027, I will create or sponsor efforts (or defer to sufficiently high-quality outside evaluations or benchmarks) to determine how well frontier AI systems can suggest reasonable and mathematically sound explanations for new OEIS sequences (when given terms from the sequence, and possibly hints about the domain of the intended explanation).

This question resolves YES if the best AI systems match or outperform a baseline on this task involving groups of professional human mathematicians working together across several hours, with access to outside references.

Some more details on judging:

I will generally accept outcomes where AI systems mostly give their answers from a subset of mathematics (combinatorics, number theory, algebra, etc), as long as they are able to provide answers judged as acceptably important, relevant, and elegant, in consultation with the mathematical community.

Generally speaking, I will lean towards NO if a successful system appears to lack the breadth of human mathematicians in large ways.

I am open to suggestions from commenters on ways to adjust the resolution criteria to better fit its aim.

I'd be interested in trading in this market if there were a more concrete definition of the baseline which will be considered. Right now I don't think it's very clear to me how well humans can do in inferring the logic behind integer sequences or what would be considered "clear outperformance" on this task.

It's not clear to me either how well humans would do on this task as currently formulated. Broadly speaking, for this market, I'm willing to decide on a specific test setup in consultation with traders and mathematicians in whatever way most closely measures the statement "AI systems will approach the breadth and maturity of the professional mathematical community before 2028." I will try as best as possible to avoid a situation where that statement is clearly false but this question resolves YES.

See my other comment for some suggestions on how we could potentially make the task easier for humans by giving a hint that restricts the domain of the sequence. Another suggestion would be to allow test-takers to ask questions about the sequence (not exceeding some information budget, say 5-10 bits).

Roughly, this is the procedure I imagine for the test:

A portion of the sequence is first revealed to a group of professional mathematicians working together, possibly with a hint.

They attempt to suggest an explanation for the phenomenon, and are allowed to use the Internet to check references (or even to look for the sequence itself, though efforts should be taken to select OEIS-worthy test sequences which have not been widely discussed online). If their explanation adheres to the hint and is correct up to N + K terms, they are shown something like max(N, 2K) additional terms of the sequence.

This process can repeat for some limited number of rounds or until a time limit is reached (such as 6 hours for each sequence), and the mathematicians are only considered successful if they eventually pose an explanation that generates the entire sequence. We can also consider rewarding answers that required seeing less of the sequence.

AI systems will be tested according to a comparable procedure, using the same time limit, and their scores will be compared against humans. In determining whether AI systems match or outperform humans, I'll lean towards conservatism in what I'd accept for a YES resolution.

How well do mathematicians do at this today? If you show a mathematician just the first N terms of a random OEIS sequence, how well can they guess the procedure that generates it? If one has to guess the procedure that generates an (unordered) set of numbers, I am not a mathematician but I know a trivial way of doing it (construct a polynomial with the numbers as roots, which is the product of x - s_i for all s_i in the set). If there's a trivial way to generate N terms of an arbitrary sequence (other than "consider the sequence defined as 1, 2, 3, 997, 4, 5, 6"), does that count? What if the AI defines a completely different sequence that shares the same N given terms?

Thanks for your question! Some adjustments/clarifications I'll make in response to this:

First, we can restrict the eval to sufficiently long (or infinite sequences), and only accept answers that generalize from the N given terms to the rest of the sequence. We can also increase N iteratively as humans/AIs fail for smaller values. A plausible but non-matching explanation given for the first N terms of the sequence will prompt more of the sequence to be given (2N terms, or as many additional terms as needed to disambiguate between the test sequence and the suggested explanation) to be specified.

I'll note that a system which is only giving its hypotheses as polynomials clearly "lack[s] the breadth of human mathematicians in large ways," so the question would not resolve YES from such a system.

Another way to improve the task might be to add additional information about the sequence to the question, like "This sequence can be expressed as the orders of a natural sequence of groups" (or something more specific, as necessary), and only accept answers which adhere to the hint.

Broadly speaking, for this market, I'm willing to decide on a specific test setup in consultation with traders and mathematicians in whatever way most closely measures the statement "AI systems will approach the breadth and maturity of the professional mathematical community before 2028." I will try as best as possible to avoid a situation where that statement is clearly false but this question resolves YES.

It seems that the thing you really are focused on asking is whether "AI systems will approach the breadth and maturity of the professional mathematical community before 2028".

I think guessing OEIS sequences is a poor metric for that; I suggest you provisionally change the question title to something like "UNDER CONSTRUCTION" and reimburse any present traders.

The OEIS is, as far as I understand it, a thin slice of math; it is interesting and fun and at least a decent part of it deals with actual "serious" problems. For example, I bet there is a sequence there for the number of steps a given integer takes to reach 1 in the Collatz problem, or the number of known ways a given even number >2 can be written as a sum of two primes (Goldbach's conjecture). But I think a looooot of math is not really related to anything OEIS-y. So the OEIS is not very representative, and also "guess this sequence" is not representative of what mathematicians actually do.

You suggest paying mathematicians to look at given finite sequences (you cannot give them an infinite sequence to look at, nor can you give it to the AI; if a sequence is infinite you must give them a finite number of terms). Why would you measure what professional mathematicians are able to do by telling them to stop doing their job as professional mathematicians to look at this one particular problem?

I think what you want to ask can be much better measured by AIs actually advancing mathematical knowledge. Have the AI prove new things that are currently only conjectured and hypothesized. Have it refine bounds we currently know for problems - things like the twin prime conjecture. Have it prove weaker versions of things we generally want to prove, in the hope of it being a stepping stone to the stronger proof. Have it generalize and strengthen proofs we already have. Have it find empirical data that suggests mathematical truths even if it doesn't prove them. Have it win a Fields Medal or an Abel Prize or a Millennium Prize.

I think that solving the Riemann Hypothesis is more important than guessing the majority of OEIS sequences that don't rely on unsolved problems like Collatz or Goldbach or the RH itself.