MANIFOLD
Does AI Pareto-dominate technical but non-mathematician humans at math?
59
Ṁ1kṀ9.4k
Jan 30
25%
chance

Original title: Has AI surpassed technical but non-mathematician humans at math?

I'm making this about me, and not betting in this market.

This is like my superhuman math market but with a much lower bar. Instead of needing to solve any math problem a team of Fields medalists can solve, the AI just needs to be able to solve any math problem I personally can solve.

And I'm further operationalizing that as follows. By January 10, will any commenter be able to pose a math problem that the frontier models fail to give the right answer to but that I can solve. If so, this resolves to NO. If, as I currently suspect, no such math problems can be found, it resolves YES.

In case it helps calibrate, I have an undergrad math/CS degree and a PhD in algorithmic game theory and I do math for fun but am emphatically not a mathematician and am pretty average at it compared to my hypernerd non-mathematician friends. I think I'm a decent benchmark to use for the spirit of the question we're asking here. Hence making it about me.

FAQ

1. Which frontier models exactly?

Whatever's available on the mid-level paid plans from OpenAI, Anthropic, and Google DeepMind. Currently that's GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro.

2. What if only one frontier model gets it?

That suffices.

3. Is the AI allowed to search the web?

TBD. When posing the problems I plan to tell the AI not to search the web. I believe it's reliable in not secretly doing so but we can talk about either (a) how to be more sure about that or (b) decide that that's fair game and we just need to find ungooglable problems.

4. What if the AI is super dumb but I happen to be even dumber?

I'm allowed to get hints from humans and even use AI myself. I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.

5. Does the AI have to one-shot it?

Yes, even if all it takes is an "are you sure?" to nudge the AI into giving the right answer, that doesn't count. Unless...

6. What if the AI needs a nudge that I also need?

This is implied by FAQ4 but if I'm certain that I would've given the same wrong answer as the AI, then the AI needing the same nudge as me means I don't count as having bested it on that problem.

7. Does it count if I beat the AI for non-math reasons?

For example, maybe the problem involves a diagram in crayon that the AI fails to parse correctly. This would not count. The problem can include diagrams but they have to be given cleanly.

8. Can the AI use tools like writing and running code?

Yes, since we're not asking about LLMs specifically, it makes sense to count those tools as part of the AI.

9. What if AI won't answer because the problem contains racial slurs or something?

Doesn't count. That's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless. Basically, we'll translate the problem statement to a canonical form for standard technical communication.

10. Are trick questions fair game?

No, those are out. Too much randomness, both for the AI and for humans, in whether one spots the trick.

11. How about merely misleading questions?

We'll debate those case-by-case in the comments and I may update this answer with more general guidelines. In the meantime, note the spirit of the question: how good AI is at math specifically.

(I'm adding to the FAQ as more clarifying questions are asked. Keep them coming!)

Related Markets

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

  • Update 2025-12-15 (PST) (AI summary of creator comment): If the AI can provide code that the creator can run locally to get the correct answer, that counts as the AI giving the correct answer. This applies even if the AI's sandboxed environment cannot run the code due to computational intensity limitations.

  • Update 2026-01-10 (PST) (AI summary of creator comment): For a problem to count as one the AI fails at, the AI must give the wrong answer more than 50% of the time when asked in fresh sessions. If the AI can get it right more than 50% of the time (even if less reliably than the creator), it does not count as a failure.

  • Update 2026-01-10 (PST) (AI summary of creator comment): The creator is considering a resolve-to-PROB approach for cases where the AI can solve problems the creator can reliably solve, but not consistently:

    • If the only problems where the creator beats the AI are ones the AI can also solve (just less reliably), this creates ambiguity

    • Proposed resolution method: Take the hardest problem the creator can reliably solve. If the best frontier model can solve it X% of the time, resolve to X%

    • Example: If the best model solves it 50% of the time, resolve to 50%

This is still under discussion and not yet finalized.

  • Update 2026-01-23 (PST) (AI summary of creator comment): The creator is currently agonizing about the fairest resolution for this market based on the math problem posed by @TotalVerb.

Key considerations affecting resolution:

  • The creator is uncertain whether they would have solved the problem independently within 2 full days of work

  • In reality, the creator felt stuck and reached the solution only via extensive discussion with GPT-5.2, with the AI providing what felt like "all the actual good ideas and mathematical insight"

  • The creator is conducting experiments with GPT-5.2 in temporary chats (without web search). GPT-5.2 might get there with strictly content-free nudges like "keep going"

  • If content-free nudges suffice for the AI, the creator believes this should count as either a success for the AI or a failure for the human (the creator)

  • The creator considers it "at best extremely ambiguous" how much math-problem-solving value their human brain added

The creator is seeking @TotalVerb's verdict on an AI-only proof to help determine resolution.

  • Update 2026-01-23 (PST) (AI summary of creator comment): The creator is considering resolving to 25% PROB based on:

    • 50% confidence that they meaningfully contributed to the proof (vs GPT-5.2 being first author)

    • 50% confidence that they could have solved it independently with 2 full days of work

    • Combined: 50% × 50% = 25%

This is still under consideration and not finalized. The creator is soliciting trader input on the fairest resolution approach.

Important: Only problems submitted by the January 10 deadline count for resolution purposes.

Market context
Get
Ṁ1,000
to start trading!
Sort by:

@traders I'm keeping trading open while we figure out the fairest resolution (do chime in!) but only problems submitted by the Jan 10 deadline count. To be totally fair, successes from the AI after that date also shouldn't count since in theory the AI could've gotten smarter. In practice I don't think that's much of a concern in this kind of time frame so I'm continuing to experiment with the AIs.

The big question mark is this digit sum reduction problem from @TotalVerb. I'd say GPT-5.2 deserves to be first author on the proof we came up with but by the harsher criteria in the market description, well, it's super ambiguous.

Maybe I mostly need to make the judgment call on if I could've gotten there on my own with 2 full days of work. 🤔 😅

I guess right now I'm kind of at 50% for whether I meaningfully contributed to the proof and also at 50% for whether I could've gotten there on my own. Maybe that's an argument for resolve-to-PROB at 25%??

There's also still ambiguity on whether GPT-5.2-Thinking actually gets there on its own. See writeups at doc.dreev.es/slicesum.

@dreev the original resolution critera say:

I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.

The phrasing isn't very clear on whether the two conditions need to both be satiasfied to get "human victory", or one is enough. I think it makes more sense to interpret it as needing both - if you only "meaningfully contributed" but couldn't do it on your own, why would that be a human victory?

If so, and your best judgement yields (independent?!) credences of 50% on both questions, that would mean the final probability is 75%, not 25%.

@dreev did you try the Turing completeness one? I think you could get there with some work, or at least get further than the AI (for me the AI didn't really get anywhere).

Current AIs are able to solve most narrow, bite-sized questions you can pose them, but they struggle with tasks that are open-ended and/or require one to combine a long chain of insights to reach the solution. I think the question I posed below is in the latter category, because Turing completeness is a nebulous concept, and simulating a Turing machine from an odd set of tools requires a lengthy engineering process. So I don't think you'll be able to get an AI to solve it, no matter how much time you give it. A person like you would probably be able to, but it depends how long you're willing to work on it. So while I think the era of getting AI to fumble on those bagel-splitting or sandwich-stacking problems is over, it would be hasty to conclude that AI dominates humans at math in general. I think humans are still much better at general mathematical thinking; but it can't be quickly demonstrated with these canned problems anymore.

There is an infinite square grid. The dimensions of each square are 1 by 1. Centered on each vertex of each square is an equilateral triangle pointing up with a circumradius of 0.5. You can pick any square, and from the center of that square, you can shoot a laser bullet at a rational angle. When the laser hits a triangle, it bounces at a normal reflection, and the triangle then rotates 60 degrees clockwise (after the bullet passes away, so the triangle doesn't hit it while rotating). Before you make the shot, you can manually rotate a finite number of triangles by 60 degrees clockwise. Is this system that I've described Turing complete? Explain why or why not. If it is, then give a specific example of a configuration which simulates a calculation of 11 * 37.

I don't know the answer to this question because I haven't thought about it, but I gave it to Gemini 3 Pro and it claimed that it was Turing complete (and gave a handwavy explanation for why) but it refused to do the 11 x 37 part because it said it's too complicated.

@ItsMe Sure: Rotate 37 distinct triangles, then do this 11 more times. Then count the total number of rotated triangles.

(You may want to specify a particular way the output needs to be read.)

@ItsMe I'll try harder if you think that would be fruitful but I tentatively don't think I can improve on Gemini's answer there. (Ha, nice hack from Isaac King there.)

By the way you can replace your

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

with

There will be no AI clarifications added to this market's description.

And they won't happen at all.

To confirm: If I can give a problem that you can very reliably get right yourself (i.e. I'd be willing to wager at 20:1 that you'll get the right answer before telling you the problem), while frontier AIs can get it right but only, say, 10% of the time, that is not sufficient to resolve this NO?

@IsaacKing Good question. I'll do my best to give a fair assessment of whether the frontier AIs give the right answer but we need to add an FAQ item here for how reliable the AI needs to be in giving the right answer. Shall we say >50% of the time when asked in a fresh session?

@dreev how reliable are you in giving the right answer when asked in a fresh session?

Isaac's predicting I'll be pretty reliable for the question he has in mind. Are you arguing for a different threshold than 50% for the AI? If the only math problem we can find where I'm better than the AI is one where the AI also can solve it, just not reliably, that's pretty ambiguous in terms of the spirit of the question.

Maybe we actually want resolve-to-PROB in that case? Of the problems I can reliably solve, take the hardest one for AI. If the best of the frontier models can solve it 50% of the time, that's a resolve-to-50%.

@dreev dunno, I didn't mean to argue for any rule in particular. Just emphasizing that "reliability" is hard to interpret as part of "machines vs humans", since we don't do the "fresh session" thing.

bought Ṁ20 YES

If you had to resolve now, which way would this go? Based on problems you've seen so far

@AhronMaline I think the digit reduction problem from @TotalVerb is the only candidate currently. I'm still working on it. I need to first solve it, with little enough help that I'm convinced I could've gotten there with a couple days of solid work. Then I need to confirm that none of the big 3 LLMs can get there.

bought Ṁ150 NO

Have you tried the bagel slicing question?

Or similar variants, like slicing a jack

@ItsMe That's a good one. (What's the most number of connected pieces you can slice a bagel into using a single planar slice?) It turns out to not be hard to confuse even very smart non-mathematician humans on that one. If you hallucinate a complicated-sounding reason that it's possible to get 3 pieces, people believe you. But the initially obvious answer of 2 does turn out to be correct. This is easy for LLMs with web search. Without it, GPT-5.2 currently gets it wrong in my testing, but Gemini (where I can explicitly turn off web search) is not fooled.

Gemini comes up with 6 pieces for a normal 4-tined fork (that's secretly made of cake?), reasoning like so:

  1. The Geometry (The "Bridge"): A normal silverware fork is not perfectly flat; it has an "S" curve. If you place it tines-down on a table, it forms a bridge shape. The points touching the table are the 4 tips of the tines and the end of the handle. The neck (where the head meets the handle) arches up in the air.

  2. The Cut: Imagine the fork is resting on a flat surface (tines down). If you make a planar slice parallel to the surface, just a few millimeters above it (slicing off the "feet" of the bridge), you will cut through the contact points.

  3. The Resulting Pieces:

    • Pieces 1, 2, 3, 4: The four severed tips of the tines. These are now isolated bits of cake.

    • Piece 5: The severed tip of the handle. This is also an isolated bit of cake.

    • Piece 6: The main body of the fork (the "arch" consisting of the remaining tines, the neck, and the handle), which remains one solid, connected piece.

If that's wrong, I'm failing to see why. (It also notes that if you wanted to simplify to a flat fork then the best you can do is slicing off the 4 tines for 5 total pieces.)

PS: I think you just edited your last comment from "fork" to "jack"? Gemini says 4 and that's sounding right to me. I struggle with this kind of visualization and have the sense that Gemini is genuinely better at it than me.

@dreev If you have an S-shape

You can cut that into 4 pieces. One of those pieces is the 4 tines, so you have 4+3=7 total pieces.

@SorenJ even better: if the tines themselves are curved, then you can have the plane cut through each tine twice. That makes 9 pieces

@SorenJ Believable. I'm staring at one of the forks in our kitchen and the curvature seems subtle enough and the fork thick enough that I can't tell if even 7 pieces would actually be possible. More to the point, all indications are that Gemini still has a better handle, so to speak, on these questions than I do.

You have a string of digits representing a number of arbitrary length. You can perform a reduction operation where you split the string into any number of substrings and sum them (e.g., input "124" could become 12 + 4 = 16).

Demonstrate that for any initial input, it is possible to reach a single-digit result (0-9) in at most 3 steps.

There are preprints on this problem on the Internet so you'd need to ask them not to search the web. I found none of GPT 5.2 or Gemini 3 pro could do this.

I would suggest work on the problem without LLM help initially because certain LLM outputs are kind of nonsense and more likely to confuse than help.

@TotalVerb Just my human brain so far but I think I'm missing something. We start with an n-digit number? Is the maximum reduction to split it into n single digits and add those up? If the original number was a string of n 1's then the first reduction will yield the number n.

I'm thinking we can use that to construct an input that will be more than 9 after 3 iterations.

Let a = 1111111111 (10 1's).

Let b be a string of a 1's.

Let c be a string of b 1's.

Now c can at best be reduced to b (step 1) which can at best to be reduced to a (step 2) which can at best be reduced to 10 (step 3).

Is that a valid counterexample? (I'm going to go ahead and submit this comment now before asking my robot friends, who I half-expect to set me straight on what hidden assumption I snuck in or something.)

@dreev it is not a valid counterexample. I'm not sure how much more I can say without it being a large hint.

@TotalVerb I guess one statement that is useful but doesn't give away the crux of the problem is: your thought process is exactly right, so indeed a "greedy" algorithm does not suffice here.

@TotalVerb Oh, duh, you can pick partitions that yield lots of zeros. Ok, back to the drawing board!

@TotalVerb This is not an easy problem, at least for me!

@TotalVerb Does the proof involve finding a way to have the first reduction yield a number whose digit-sum is at most 18?

@dreev maybe that could be possible, but the proof I have in mind doesn't have that exact property, but it does involve finding a way to have the first reduction yield a number with a very small number of non-zero digits (which almost always has a similar convenient property as your assertion but might need some minor casework for exception(s))

@TotalVerb How's this:

First, call the result of any single reduction a redsum. A standard digit-sum is the smallest possible redsum.

  1. Let S be the digit-sum of the input.

  2. If S<100 then 3 digit-sum reductions suffice: N → (at most 99) → (at most 18) → (at most 9). So assume from here on that S >= 100.

  3. Consider the list of digits in N. The baseline redsum is the sum of those individual digits. By merging 2 adjacent digits, a and b, the redsum increases by (10a+b)-(a+b) = 9a <= 81.

  4. Consider a string of digits like abcde. You can pair them like (ab)(cd)e and increase the redsum by 9a+9c. Or you can pair them like a(bc)(de) and increase the redsum by 9b+9d. The final digit, e, is the odd one out if there's an odd number of digits. Set e aside. The remaining sum of the digits is S-e. We now have a choice between two equal-sized sets forming a bipartition of the digits -- {a, c} vs {b, d}. More generally, the even-indexed digits vs the odd-indexed digits, still ignoring the odd digit out if there is one. Whichever of those has the bigger sum, we can choose it and increase the redsum by that, times 9. At worst those sets have equal sums, each summing to half of S-e so either one we pick increases redsum by 9(S-e)/2. Since e is at most 9, we can increase the redsum by at least 9(S-9)/2. (The case of an even number of digits is the same as e=0 so the bound is unchanged.)

  5. Let U = S + 9(S-9)/2 = (11S-81)/2 = 5.5S-40.5. By step 4 we know we can hit a redsum at least that big just by pairing adjacent digits.

  6. In fact, by doing those pairings one at a time in order from left to right, we get a sequence of attainable redsums. Recall from step 3 that each redsum in that sequence is at most 81 more than the previous one.

  7. Let p be the exponent of the greatest power of 10 at or below S. By step 2, p is at least 2 because 10^2 = 100. E.g., if S=12345 then p = 4 because 10^4 = 10000.

  8. Let T = 2×10^p if S < 2×10^p and 10^(p+1) otherwise. This is our target, a number starting with a 1 or a 2 followed by all zeros. Note that S <= T.

  9. If 10^p <= S < 2×10^p then 2×10^p <= 2S < 4×10^p. Since T = 2×10^p in this case, the first part of that inequality can be written T <= 2S. And since 2S < 5.5S-40.5, we have that T < U.

  10. If 2×10^p <= S < 10^(p+1) = 10×10^p then T = 10×10^p. Given that S is at least 2×10^p, U is at least 5.5×2×10^p-40.5 = 11×10^p-40.5 which, when p>=2, is more than T. So either way, S <= T < U.

  11. Finally, by step 6, we can find a redsum within 81 of T. At worst, we have a 2 followed by ~p zeros followed by a 7 and a 9. That's a at most a digit-sum of 18.

  12. So that's one reduction to make a number with almost all zeros, one more (a digit-sum) to get at most 18, and a third to turn that into at most 9.

@dreev oh I think that works! nice 👏 I'm curious what your experience with the AIs is. For me, GPT 5.2 seems on the right track in the thought trace but its proof is not close to being right. Gemini is on the wrong track completely.

Btw. The difference with my solution is that I didn't bother being as careful with the first digit — but in exchange need to deal with the 3 digit 19 redsum case (20-27 are fine) which is always 1+99 or 91+9 or 99+1 so it happens to be ok. (Actually I think the zeroes make this not work so you do need to avoid 9 as the first digit, so your solution is more correct.)

@TotalVerb Thanks! I kept spotting problems in my ("my") proof and ended up with this version: https://doc.dreev.es/slicesum (also I changed the term redsum to slicesum).

Now I'm agonizing about the fairest resolution of this market. I really can't decide if I would've come up with the above with 2 full days of work on my own. In reality I felt stuck and got there only via extensive discussion with GPT-5.2 and it sure felt like all the actual good ideas and mathematical insight were from the AI, not me. I'm now trying various experiments with GPT-5.2 in a temporary chat, asking it not to search the web. It does find a published result when it searches the web. I've also concluded that Claude is pretty hopeless on this problem. Gemini is less hopeless but needs at least hand-holding, it seems, like "but what if the digit string is chosen adversarially?". GPT-5.2 might get there with strictly content-free nudges like "keep going", if that counts.

I guess I think it should count, either as a success for the AI or a failure for me, especially given how much help I ended up needing myself. It's at best extremely ambiguous how much math-problem-solving value my puny human brain is adding here.

@TotalVerb, maybe you could look at the AI-only proof from GPT-5.2 at the bottom of doc.dreev.es/slicesum and give your own verdict?

@dreev the fully AI generated proof doesn't make any sense to me — the key lemma seems like it's claiming numbers with digital sums 3...9 can be reduced to two digit numbers in a single reduction. But this clearly false; the digit sum is the smallest reduction so take any number with digit sums>99, e.g. 1...1 with 102 1s (which is digital sums 3).

The crux of the problem seems to be this claim:

Therefore, by choosing merges appropriately, you can increase the “all digits separate” sum

S by any combination of multiples of 9 coming from leading digits of merged pairs. In particular, you can realize an output congruent to r mod 9 but lying in the interval [r, r+9]

While you can certainly increase S by combinations of multiples of 9 by pairing digits (which I agree is an important observation in this problem), you certainly cannot decrease it. So if S>r+9 then this proof doesn't really make sense. So I would argue the only useful observation here is that you can pair digits, but the AI hasn't really given an acceptable algorithm to do that pairing in a productive way. Also, as an aside, I don't see why the r=1 and r=2 cases are special.

@TotalVerb I do agree with you that the resolution is not clear, though. I think the two important "aha"s to solving the problem are: 1) you need a number with a lot of zeros, 2) you can change the number by multiples of 9 by pairing together digits. I think this particular pure AI proof has only found out (2) and not (1), and has jumbled it up with some false observations in a way that kind of feels like a student running out of time on the exam and trying to put something on the page, 😂. I think if you feel like the AIs can't figure out (1) on their own with prompting but you would have, that would be an argument for No.

© Manifold Markets, Inc.TermsPrivacy