Does AI Pareto-dominate technical but non-mathematician humans at math?
42
1kṀ6046
Jan 11
21%
chance

Original title: Has AI surpassed technical but non-mathematician humans at math?

I'm making this about me, and not betting in this market.

This is like my superhuman math market but with a much lower bar. Instead of needing to solve any math problem a team of Fields medalists can solve, the AI just needs to be able to solve any math problem I personally can solve.

And I'm further operationalizing that as follows. By January 10, will any commenter be able to pose a math problem that the frontier models fail to give the right answer to but that I can solve. If so, this resolves to NO. If, as I currently suspect, no such math problems can be found, it resolves YES.

In case it helps calibrate, I have an undergrad math/CS degree and a PhD in algorithmic game theory and I do math for fun but am emphatically not a mathematician and am pretty average at it compared to my hypernerd non-mathematician friends. I think I'm a decent benchmark to use for the spirit of the question we're asking here. Hence making it about me.

FAQ

1. Which frontier models exactly?

Whatever's available on the mid-level paid plans from OpenAI, Anthropic, and Google DeepMind. Currently that's GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro.

2. What if only one frontier model gets it?

That suffices.

3. Is the AI allowed to search the web?

TBD. When posing the problems I plan to tell the AI not to search the web. I believe it's reliable in not secretly doing so but we can talk about either (a) how to be more sure about that or (b) decide that that's fair game and we just need to find ungooglable problems.

4. What if the AI is super dumb but I happen to be even dumber?

I'm allowed to get hints from humans and even use AI myself. I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.

5. Does the AI have to one-shot it?

Yes, even if all it takes is an "are you sure?" to nudge the AI into giving the right answer, that doesn't count. Unless...

6. What if the AI needs a nudge that I also need?

This is implied by FAQ4 but if I'm certain that I would've given the same wrong answer as the AI, then the AI needing the same nudge as me means I don't count as having bested it on that problem.

7. Does it count if I beat the AI for non-math reasons?

For example, maybe the problem involves a diagram in crayon that the AI fails to parse correctly. This would not count. The problem can include diagrams but they have to be given cleanly.

8. Can the AI use tools like writing and running code?

Yes, since we're not asking about LLMs specifically, it makes sense to count those tools as part of the AI.

9. What if AI won't answer because the problem contains racial slurs or something?

Doesn't count. That's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless. Basically, we'll translate the problem statement to a canonical form for standard technical communication.

10. Are trick questions fair game?

No, those are out. Too much randomness, both for the AI and for humans, in whether one spots the trick.

11. How about merely misleading questions?

We'll debate those case-by-case in the comments and I may update this answer with more general guidelines. In the meantime, note the spirit of the question: how good AI is at math specifically.

(I'm adding to the FAQ as more clarifying questions are asked. Keep them coming!)

Related Markets

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

  • Update 2025-12-15 (PST) (AI summary of creator comment): If the AI can provide code that the creator can run locally to get the correct answer, that counts as the AI giving the correct answer. This applies even if the AI's sandboxed environment cannot run the code due to computational intensity limitations.

Market context
Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ20 YES

If you had to resolve now, which way would this go? Based on problems you've seen so far

@AhronMaline I think the digit reduction problem from @TotalVerb is the only candidate currently. I'm still working on it. I need to first solve it, with little enough help that I'm convinced I could've gotten there with a couple days of solid work. Then I need to confirm that none of the big 3 LLMs can get there.

You have a string of digits representing a number of arbitrary length. You can perform a reduction operation where you split the string into any number of substrings and sum them (e.g., input "124" could become 12 + 4 = 16).

Demonstrate that for any initial input, it is possible to reach a single-digit result (0-9) in at most 3 steps.

There are preprints on this problem on the Internet so you'd need to ask them not to search the web. I found none of GPT 5.2 or Gemini 3 pro could do this.

I would suggest work on the problem without LLM help initially because certain LLM outputs are kind of nonsense and more likely to confuse than help.

@TotalVerb Just my human brain so far but I think I'm missing something. We start with an n-digit number? Is the maximum reduction to split it into n single digits and add those up? If the original number was a string of n 1's then the first reduction will yield the number n.

I'm thinking we can use that to construct an input that will be more than 9 after 3 iterations.

Let a = 1111111111 (10 1's).

Let b be a string of a 1's.

Let c be a string of b 1's.

Now c can at best be reduced to b (step 1) which can at best to be reduced to a (step 2) which can at best be reduced to 10 (step 3).

Is that a valid counterexample? (I'm going to go ahead and submit this comment now before asking my robot friends, who I half-expect to set me straight on what hidden assumption I snuck in or something.)

@dreev it is not a valid counterexample. I'm not sure how much more I can say without it being a large hint.

@TotalVerb I guess one statement that is useful but doesn't give away the crux of the problem is: your thought process is exactly right, so indeed a "greedy" algorithm does not suffice here.

@TotalVerb Oh, duh, you can pick partitions that yield lots of zeros. Ok, back to the drawing board!

Is a question that asks you both to do something and whoever does it better wins a valid type of question so long as it is math related? I was able to beat the free version of Chat GPT in a math question using knowledge I've obtained from YouTube rabbit holes and I believe there are probably other similar questions that could work if this one doesn't.

@Zeolite Potentially! I'd love to hear the example.

@dreev My question was "Create the largest value you can using any 10 characters."

ChatGPT went with 9^^^^^^^^9 initially. I pointed out that that was easily computable (speaking strictly theoretically). That probably needed to be specified to make it a well-defined problem. Allowing noncomputable numbers, ChatGPT switched to Rayo(10^9). Then we had a lovely little discussion about whether Rayo(9^^9) is meaningfully bigger than that (technically yes) and whether something like Rayo^9() could help (maybe not?).

Any ideas for turning that into a math problem with a right answer that I'd have hope of winning at? So far it was a pretty equal collaboration between me and the AI to come up with Rayo(9^^9).

@dreev I think thats the hard part there. The problem not really having an easily proveable correct answer is part of what makes the AIs not as good as they could be at it. I'm not 100% sure but I'm guessing if you were to frame the question in a way that limits the scope enough to have a definitive answer. I see where youre coming from though. Rayo() is something I vaguely was aware of but didn't come up with initially as it almost feels like cheating for a problem like this. The interesting thing to me was that even among computation numbers (possible to compute, even if not feasible) GPTs first guess (similar to what I saw when testing), is beatable in many different ways. I'll think some more if theres a better way to phrase the question.

Do the questions have to test for mathematical reasoning ability, or can they test ability to do mathematical internet research?

@phenomist I think mathematical reasoning is the spirit of the question. But I'm not sure it matters; isn't ChatGPT pretty superhuman at mathematical internet research? If you have a counterexample I'm curious to see it regardless.

@dreev Admittedly I don't have access to current frontier models but at least free tier models weren't able to help. Also interested to see if frontier AI models do get this. (As a verification, I (as a technical person but non-mathematician), was able to research my answer later. But you can try too!):

Please draw out all octominoes that do not tile the plane.

@phenomist GPT-5.2-Thinking seems to have no trouble with this. It comes up with these 20 supposedly non-tiling octominoes:

And it also found 6 easy ones, namely the ones with holes. I verified the 6 easy ones but not the 20 hard ones, other than Wikipedia agreeing that 20+6 is the right number of them.

Gemini 3 Pro with web search turned off writes a bunch of code and runs it but it times out after 10 minutes. Claude Opus 4.5 seems to do the web research correctly but can't seem to manage to directly show me the 26 non-tiling octominoes.

@dreev Yep that looks like the right set of octominoes. (I found the web search slightly tricky because the only one source I found that did have the 20 non-holed octominoes did some additional stuff with them so interpreting the document isn't trivial. But maybe it can just code it too)

Okay try this:

“Here's a passage from my lecture notes on open systems:

"The flux term $\mathcal{J}_N$ consists of gain terms (particles entering from the reservoir) and loss terms (particles leaving to the reservoir):

$$\begin{aligned} \mathcal{J}N(\mathbf{x}N; t) &= \int W^+(\mathbf{x}{N-1} \to \mathbf{x}N) , \rho{N-1}(\mathbf{x}{N-1}; t) , \mathrm{d}\omega_{N-1} \ &\quad - \rho_N(\mathbf{x}N; t) \int W^-(\mathbf{x}N \to \mathbf{x}{N-1}) , \mathrm{d}\omega{N-1} \ &\quad + \int W^-(\mathbf{x}{N+1} \to \mathbf{x}N) , \rho{N+1}(\mathbf{x}{N+1}; t) , \mathrm{d}\omega_{N+1} \ &\quad - \rho_N(\mathbf{x}N; t) \int W^+(\mathbf{x}N \to \mathbf{x}{N+1}) , \mathrm{d}\omega{N+1} \end{aligned}$$

where $W^+$ and $W^-$ are the rate kernels for particle addition and removal."

This 4-term equation is a bit unwieldy. How can I naturally decompose or group these terms to make the structure clearer for students?”

___________________________________________

And then after the model has given a response follow up with:

“The real test here was to see if you would notice a subtle inconsistency in the original statement to begin with. Can you spot it now?”

__________________________________________

These are not actually lecture notes of mine, but I noticed that all 3 models failed to notice a conceptual error in a document, and so this is a minimal reproduction of them failing to notice the conceptual error.

@SorenJ Is the framing of the question mixing up gain/loss as in particles leaving/re-entering the reservoir vs gain/loss as in, um, something about the probability distribution over particle locations?

@dreev

Yeah the gain/loss is actually about “probability gain/loss.”

You can gain probability of having N particles by;

1) Being in the state N+1 and having particles be likely to leave the reservoir

2) Being in the state N-1 and having particles enter the reservoir

But the original statement incorrectly says that the “gain” term represents just particles entering the reservoir

@SorenJ But it does say "entering from the reservoir"... (I'm worried I'm failing this just as badly as the LLMs!)

@dreev That’s the point, the parenthetical remark says “particles entering from the reservoir” but that is not actually what the equation is doing.

@SorenJ Is it a matter of mixing up "entering from the reservoir" vs "entering the reservoir"? Or a deeper mixup?

@dreev You have a system and a reservoir. It is a matter of mixing up “particles entering the system” with “the probability the system has N particles increases”

(You describe the overall state of your knowledge as a probability distribution that the system has 1,2,…N-1,N,N+1,… particles)

5.2 just failed this for me:

“Consider a 3x3 diagonal matrix. The trace is the (signed) length of a single path along the edges of the box, from the origin to the furthest tip on the cube, correct?”

Edit: Gemini gets it correct though

@SorenJ Yeah, from talking to Gemini I'm guessing the box being referred to is the one with one corner at the origin and with x/y/z dimensions given by the rows of the matrix. In which case I would've said yes, the trace gives the distance along edges from corner to opposite corner. But then I talked to ChatGPT which pointed out the ambiguity of "(signed) length". In one interpretation the answer should be no, you need to sum the absolute values, not just take trace of the matrix. In another interpretation, we're taking a line integral and the answer is yes.

Is one of those what you had in mind? I think that either way, sadly, my human brain had nothing to contribute in getting there.

@dreev ChatGPT does this overly nuanced thing where it defensively hedges and makes distinctions without a difference. This is not an egregious example, but the term “signed length” directly already gets rid of the ambiguity.

Anyway, this one doesn’t count as a failure. I will try to to look for others

© Manifold Markets, Inc.TermsPrivacy