Does AI Pareto-dominate technical but non-mathematician humans at math?
59
1kṀ9360
Jan 11
28%
chance

Original title: Has AI surpassed technical but non-mathematician humans at math?

I'm making this about me, and not betting in this market.

This is like my superhuman math market but with a much lower bar. Instead of needing to solve any math problem a team of Fields medalists can solve, the AI just needs to be able to solve any math problem I personally can solve.

And I'm further operationalizing that as follows. By January 10, will any commenter be able to pose a math problem that the frontier models fail to give the right answer to but that I can solve. If so, this resolves to NO. If, as I currently suspect, no such math problems can be found, it resolves YES.

In case it helps calibrate, I have an undergrad math/CS degree and a PhD in algorithmic game theory and I do math for fun but am emphatically not a mathematician and am pretty average at it compared to my hypernerd non-mathematician friends. I think I'm a decent benchmark to use for the spirit of the question we're asking here. Hence making it about me.

FAQ

1. Which frontier models exactly?

Whatever's available on the mid-level paid plans from OpenAI, Anthropic, and Google DeepMind. Currently that's GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro.

2. What if only one frontier model gets it?

That suffices.

3. Is the AI allowed to search the web?

TBD. When posing the problems I plan to tell the AI not to search the web. I believe it's reliable in not secretly doing so but we can talk about either (a) how to be more sure about that or (b) decide that that's fair game and we just need to find ungooglable problems.

4. What if the AI is super dumb but I happen to be even dumber?

I'm allowed to get hints from humans and even use AI myself. I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.

5. Does the AI have to one-shot it?

Yes, even if all it takes is an "are you sure?" to nudge the AI into giving the right answer, that doesn't count. Unless...

6. What if the AI needs a nudge that I also need?

This is implied by FAQ4 but if I'm certain that I would've given the same wrong answer as the AI, then the AI needing the same nudge as me means I don't count as having bested it on that problem.

7. Does it count if I beat the AI for non-math reasons?

For example, maybe the problem involves a diagram in crayon that the AI fails to parse correctly. This would not count. The problem can include diagrams but they have to be given cleanly.

8. Can the AI use tools like writing and running code?

Yes, since we're not asking about LLMs specifically, it makes sense to count those tools as part of the AI.

9. What if AI won't answer because the problem contains racial slurs or something?

Doesn't count. That's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless. Basically, we'll translate the problem statement to a canonical form for standard technical communication.

10. Are trick questions fair game?

No, those are out. Too much randomness, both for the AI and for humans, in whether one spots the trick.

11. How about merely misleading questions?

We'll debate those case-by-case in the comments and I may update this answer with more general guidelines. In the meantime, note the spirit of the question: how good AI is at math specifically.

(I'm adding to the FAQ as more clarifying questions are asked. Keep them coming!)

Related Markets

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

  • Update 2025-12-15 (PST) (AI summary of creator comment): If the AI can provide code that the creator can run locally to get the correct answer, that counts as the AI giving the correct answer. This applies even if the AI's sandboxed environment cannot run the code due to computational intensity limitations.

  • Update 2026-01-10 (PST) (AI summary of creator comment): For a problem to count as one the AI fails at, the AI must give the wrong answer more than 50% of the time when asked in fresh sessions. If the AI can get it right more than 50% of the time (even if less reliably than the creator), it does not count as a failure.

  • Update 2026-01-10 (PST) (AI summary of creator comment): The creator is considering a resolve-to-PROB approach for cases where the AI can solve problems the creator can reliably solve, but not consistently:

    • If the only problems where the creator beats the AI are ones the AI can also solve (just less reliably), this creates ambiguity

    • Proposed resolution method: Take the hardest problem the creator can reliably solve. If the best frontier model can solve it X% of the time, resolve to X%

    • Example: If the best model solves it 50% of the time, resolve to 50%

This is still under discussion and not yet finalized.

Market context
Get
Ṁ1,000
to start trading!
Sort by:

Current AIs are able to solve most narrow, bite-sized questions you can pose them, but they struggle with tasks that are open-ended and/or require one to combine a long chain of insights to reach the solution. I think the question I posed below is in the latter category, because Turing completeness is a nebulous concept, and simulating a Turing machine from an odd set of tools requires a lengthy engineering process. So I don't think you'll be able to get an AI to solve it, no matter how much time you give it. A person like you would probably be able to, but it depends how long you're willing to work on it. So while I think the era of getting AI to fumble on those bagel-splitting or sandwich-stacking problems is over, it would be hasty to conclude that AI dominates humans at math in general. I think humans are still much better at general mathematical thinking; but it can't be quickly demonstrated with these canned problems anymore.

There is an infinite square grid. The dimensions of each square are 1 by 1. Centered on each vertex of each square is an equilateral triangle pointing up with a circumradius of 0.5. You can pick any square, and from the center of that square, you can shoot a laser bullet at a rational angle. When the laser hits a triangle, it bounces at a normal reflection, and the triangle then rotates 60 degrees clockwise (after the bullet passes away, so the triangle doesn't hit it while rotating). Before you make the shot, you can manually rotate a finite number of triangles by 60 degrees clockwise. Is this system that I've described Turing complete? Explain why or why not. If it is, then give a specific example of a configuration which simulates a calculation of 11 * 37.

I don't know the answer to this question because I haven't thought about it, but I gave it to Gemini 3 Pro and it claimed that it was Turing complete (and gave a handwavy explanation for why) but it refused to do the 11 x 37 part because it said it's too complicated.

@ItsMe Sure: Rotate 37 distinct triangles, then do this 11 more times. Then count the total number of rotated triangles.

(You may want to specify a particular way the output needs to be read.)

By the way you can replace your

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

with

There will be no AI clarifications added to this market's description.

And they won't happen at all.

To confirm: If I can give a problem that you can very reliably get right yourself (i.e. I'd be willing to wager at 20:1 that you'll get the right answer before telling you the problem), while frontier AIs can get it right but only, say, 10% of the time, that is not sufficient to resolve this NO?

@IsaacKing Good question. I'll do my best to give a fair assessment of whether the frontier AIs give the right answer but we need to add an FAQ item here for how reliable the AI needs to be in giving the right answer. Shall we say >50% of the time when asked in a fresh session?

@dreev how reliable are you in giving the right answer when asked in a fresh session?

Isaac's predicting I'll be pretty reliable for the question he has in mind. Are you arguing for a different threshold than 50% for the AI? If the only math problem we can find where I'm better than the AI is one where the AI also can solve it, just not reliably, that's pretty ambiguous in terms of the spirit of the question.

Maybe we actually want resolve-to-PROB in that case? Of the problems I can reliably solve, take the hardest one for AI. If the best of the frontier models can solve it 50% of the time, that's a resolve-to-50%.

@dreev dunno, I didn't mean to argue for any rule in particular. Just emphasizing that "reliability" is hard to interpret as part of "machines vs humans", since we don't do the "fresh session" thing.

bought Ṁ20 YES

If you had to resolve now, which way would this go? Based on problems you've seen so far

@AhronMaline I think the digit reduction problem from @TotalVerb is the only candidate currently. I'm still working on it. I need to first solve it, with little enough help that I'm convinced I could've gotten there with a couple days of solid work. Then I need to confirm that none of the big 3 LLMs can get there.

bought Ṁ150 NO

Have you tried the bagel slicing question?

Or similar variants, like slicing a jack

@ItsMe That's a good one. (What's the most number of connected pieces you can slice a bagel into using a single planar slice?) It turns out to not be hard to confuse even very smart non-mathematician humans on that one. If you hallucinate a complicated-sounding reason that it's possible to get 3 pieces, people believe you. But the initially obvious answer of 2 does turn out to be correct. This is easy for LLMs with web search. Without it, GPT-5.2 currently gets it wrong in my testing, but Gemini (where I can explicitly turn off web search) is not fooled.

Gemini comes up with 6 pieces for a normal 4-tined fork (that's secretly made of cake?), reasoning like so:

  1. The Geometry (The "Bridge"): A normal silverware fork is not perfectly flat; it has an "S" curve. If you place it tines-down on a table, it forms a bridge shape. The points touching the table are the 4 tips of the tines and the end of the handle. The neck (where the head meets the handle) arches up in the air.

  2. The Cut: Imagine the fork is resting on a flat surface (tines down). If you make a planar slice parallel to the surface, just a few millimeters above it (slicing off the "feet" of the bridge), you will cut through the contact points.

  3. The Resulting Pieces:

    • Pieces 1, 2, 3, 4: The four severed tips of the tines. These are now isolated bits of cake.

    • Piece 5: The severed tip of the handle. This is also an isolated bit of cake.

    • Piece 6: The main body of the fork (the "arch" consisting of the remaining tines, the neck, and the handle), which remains one solid, connected piece.

If that's wrong, I'm failing to see why. (It also notes that if you wanted to simplify to a flat fork then the best you can do is slicing off the 4 tines for 5 total pieces.)

PS: I think you just edited your last comment from "fork" to "jack"? Gemini says 4 and that's sounding right to me. I struggle with this kind of visualization and have the sense that Gemini is genuinely better at it than me.

@dreev If you have an S-shape

You can cut that into 4 pieces. One of those pieces is the 4 tines, so you have 4+3=7 total pieces.

@SorenJ even better: if the tines themselves are curved, then you can have the plane cut through each tine twice. That makes 9 pieces

@SorenJ Believable. I'm staring at one of the forks in our kitchen and the curvature seems subtle enough and the fork thick enough that I can't tell if even 7 pieces would actually be possible. More to the point, all indications are that Gemini still has a better handle, so to speak, on these questions than I do.

You have a string of digits representing a number of arbitrary length. You can perform a reduction operation where you split the string into any number of substrings and sum them (e.g., input "124" could become 12 + 4 = 16).

Demonstrate that for any initial input, it is possible to reach a single-digit result (0-9) in at most 3 steps.

There are preprints on this problem on the Internet so you'd need to ask them not to search the web. I found none of GPT 5.2 or Gemini 3 pro could do this.

I would suggest work on the problem without LLM help initially because certain LLM outputs are kind of nonsense and more likely to confuse than help.

@TotalVerb Just my human brain so far but I think I'm missing something. We start with an n-digit number? Is the maximum reduction to split it into n single digits and add those up? If the original number was a string of n 1's then the first reduction will yield the number n.

I'm thinking we can use that to construct an input that will be more than 9 after 3 iterations.

Let a = 1111111111 (10 1's).

Let b be a string of a 1's.

Let c be a string of b 1's.

Now c can at best be reduced to b (step 1) which can at best to be reduced to a (step 2) which can at best be reduced to 10 (step 3).

Is that a valid counterexample? (I'm going to go ahead and submit this comment now before asking my robot friends, who I half-expect to set me straight on what hidden assumption I snuck in or something.)

@dreev it is not a valid counterexample. I'm not sure how much more I can say without it being a large hint.

@TotalVerb I guess one statement that is useful but doesn't give away the crux of the problem is: your thought process is exactly right, so indeed a "greedy" algorithm does not suffice here.

@TotalVerb Oh, duh, you can pick partitions that yield lots of zeros. Ok, back to the drawing board!

@TotalVerb This is not an easy problem, at least for me!

Is a question that asks you both to do something and whoever does it better wins a valid type of question so long as it is math related? I was able to beat the free version of Chat GPT in a math question using knowledge I've obtained from YouTube rabbit holes and I believe there are probably other similar questions that could work if this one doesn't.

@Zeolite Potentially! I'd love to hear the example.

@dreev My question was "Create the largest value you can using any 10 characters."

ChatGPT went with 9^^^^^^^^9 initially. I pointed out that that was easily computable (speaking strictly theoretically). That probably needed to be specified to make it a well-defined problem. Allowing noncomputable numbers, ChatGPT switched to Rayo(10^9). Then we had a lovely little discussion about whether Rayo(9^^9) is meaningfully bigger than that (technically yes) and whether something like Rayo^9() could help (maybe not?).

Any ideas for turning that into a math problem with a right answer that I'd have hope of winning at? So far it was a pretty equal collaboration between me and the AI to come up with Rayo(9^^9).

@dreev I think thats the hard part there. The problem not really having an easily proveable correct answer is part of what makes the AIs not as good as they could be at it. I'm not 100% sure but I'm guessing if you were to frame the question in a way that limits the scope enough to have a definitive answer. I see where youre coming from though. Rayo() is something I vaguely was aware of but didn't come up with initially as it almost feels like cheating for a problem like this. The interesting thing to me was that even among computation numbers (possible to compute, even if not feasible) GPTs first guess (similar to what I saw when testing), is beatable in many different ways. I'll think some more if theres a better way to phrase the question.

© Manifold Markets, Inc.TermsPrivacy