Original title: Has AI surpassed technical but non-mathematician humans at math?
I'm making this about me, and not betting in this market.
This is like my superhuman math market but with a much lower bar. Instead of needing to solve any math problem a team of Fields medalists can solve, the AI just needs to be able to solve any math problem I personally can solve.
And I'm further operationalizing that as follows. By January 10, will any commenter be able to pose a math problem that the frontier models fail to give the right answer to but that I can solve. If so, this resolves to NO. If, as I currently suspect, no such math problems can be found, it resolves YES.
In case it helps calibrate, I have an undergrad math/CS degree and a PhD in algorithmic game theory and I do math for fun but am emphatically not a mathematician and am pretty average at it compared to my hypernerd non-mathematician friends. I think I'm a decent benchmark to use for the spirit of the question we're asking here. Hence making it about me.
FAQ
1. Which frontier models exactly?
Whatever's available on the mid-level paid plans from OpenAI, Anthropic, and Google DeepMind. Currently that's GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro.
2. What if only one frontier model gets it?
That suffices.
3. Is the AI allowed to search the web?
TBD. When posing the problems I plan to tell the AI not to search the web. I believe it's reliable in not secretly doing so but we can talk about either (a) how to be more sure about that or (b) decide that that's fair game and we just need to find ungooglable problems.
4. What if the AI is super dumb but I happen to be even dumber?
I'm allowed to get hints from humans and even use AI myself. I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.
5. Does the AI have to one-shot it?
Yes, even if all it takes is an "are you sure?" to nudge the AI into giving the right answer, that doesn't count. Unless...
6. What if the AI needs a nudge that I also need?
This is implied by FAQ4 but if I'm certain that I would've given the same wrong answer as the AI, then the AI needing the same nudge as me means I don't count as having bested it on that problem.
7. Does it count if I beat the AI for non-math reasons?
For example, maybe the problem involves a diagram in crayon that the AI fails to parse correctly. This would not count. The problem can include diagrams but they have to be given cleanly.
8. Can the AI use tools like writing and running code?
Yes, since we're not asking about LLMs specifically, it makes sense to count those tools as part of the AI.
9. What if AI won't answer because the problem contains racial slurs or something?
Doesn't count. That's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless. Basically, we'll translate the problem statement to a canonical form for standard technical communication.
10. Are trick questions fair game?
No, those are out. Too much randomness, both for the AI and for humans, in whether one spots the trick.
11. How about merely misleading questions?
We'll debate those case-by-case in the comments and I may update this answer with more general guidelines. In the meantime, note the spirit of the question: how good AI is at math specifically.
12. Does it count as the AI solving the problem if it can write valid code to solve it but can't run that code due to computational restrictions?
Yeah, if I can run the code locally, we'll count that.
13. What if the AI is inconsistent in getting the right answer to a problem?
If it gets it right more than half the time in a fresh session, it counts. Or we could resolve-to-PROB for how often it solves a problem I can solve, if there's just one such? I think this turns out to be moot for this market.
(I'm adding to the FAQ as more clarifying questions are asked. Keep them coming!)
Related Markets
[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]
People are also trading
@traders I'm keeping trading open while we figure out the fairest resolution (do chime in!) but only problems submitted by the Jan 10 deadline count. To be totally fair, successes from the AI after that date also shouldn't count since in theory the AI could've gotten smarter. In practice I don't think that's much of a concern in this kind of time frame so I'm continuing to experiment with the AIs.
The big question mark is this digit sum reduction problem from @TotalVerb. I'd say GPT-5.2 deserves to be first author on the proof we came up with but by the harsher criteria in the market description, well, it's super ambiguous.
Maybe I mostly need to make the judgment call on if I could've gotten there on my own with 2 full days of work. 🤔 😅
I guess right now I'm kind of at 50% for whether I meaningfully contributed to the proof and also at 50% for whether I could've gotten there on my own. Maybe that's an argument for resolve-to-PROB at 25%??
There's also still ambiguity on whether GPT-5.2-Thinking actually gets there on its own. See writeups at doc.dreev.es/slicesum.
@dreev the original resolution critera say:
I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.
The phrasing isn't very clear on whether the two conditions need to both be satiasfied to get "human victory", or one is enough. I think it makes more sense to interpret it as needing both - if you only "meaningfully contributed" but couldn't do it on your own, why would that be a human victory?
If so, and your best judgement yields (independent?!) credences of 50% on both questions, that would mean the final probability is 75%, not 25%.
@dreev did you try the Turing completeness one? I think you could get there with some work, or at least get further than the AI (for me the AI didn't really get anywhere).
@ItsMe I thought that one was (a) underspecified (see Isaac King's comment) and (b) Gemini's answer sounded every bit as good as I could come up with.
@traders Here's a tentative argument for resolve-to-PROB in light of the genuine ambiguity we still had as of January 10:
Arguing for NO: It didn't feel like I meaningfully contributed to the solution to the slice-sum problem but technically I couldn't get any of the golems to give a correct proof on their own, whereas I did manage to produce a correct proof with GPT's extensive help. I guess I couldn't have done that without getting my own head around the core ideas of the proof. Which suggests I meaningfully contributed to the solution. Whether I could've gotten there on my own in 2 days is what's genuinely unknown. It's at least plausible.
Arguing for YES: If the answer wasn't YES in January, it became YES pretty shortly thereafter. AI is now solving actually interesting open problems. GPT-5.5-thinking runs circles around me.
I suppose that that NO argument is stronger than that YES argument, so maybe resolve-to-PROB at 25%, as I suggested in January, is fairest?
@dreev as I commented above, I think that to get a NO you need to have "meaningfully contributed" AND to have been able to do it on your own in two days. If you're 50-50 on the latter question, then 50% should be a lower bound on your probability estimate of the true resolution
@AhronMaline Thanks, yeah, if I'm understanding, this is further argument in favor of resolve-to-PROB at 25%? 50% that I contributed times 50% that I could've gotten there on my own.
@dreev tbc, the number 75% is assuming the two questions are independent, which isn't very reasonable. But 50% would be a lower bound.
@traders I guess 50% is the best compromise? The condition conjunction (did I meaningfully contribute AND could I have gotten there on my own) was kind of a rationale to get closer to NO and the final market price but I had the direction backwards.
@dreev I had only placed a bet because you had previously said it would resolve at 25% and it was at 27%
@prismatic I don't know, that might be on you. I used 2 question marks and everything:
I guess right now I'm kind of at 50% for whether I meaningfully contributed to the proof and also at 50% for whether I could've gotten there on my own. Maybe that's an argument for resolve-to-PROB at 25%??
On the other hand, it was never about the conjunction. Just that I thought the NO argument (AI wasn't quite dominating me on that problem) felt a little stronger.
Current AIs are able to solve most narrow, bite-sized questions you can pose them, but they struggle with tasks that are open-ended and/or require one to combine a long chain of insights to reach the solution. I think the question I posed below is in the latter category, because Turing completeness is a nebulous concept, and simulating a Turing machine from an odd set of tools requires a lengthy engineering process. So I don't think you'll be able to get an AI to solve it, no matter how much time you give it. A person like you would probably be able to, but it depends how long you're willing to work on it. So while I think the era of getting AI to fumble on those bagel-splitting or sandwich-stacking problems is over, it would be hasty to conclude that AI dominates humans at math in general. I think humans are still much better at general mathematical thinking; but it can't be quickly demonstrated with these canned problems anymore.
There is an infinite square grid. The dimensions of each square are 1 by 1. Centered on each vertex of each square is an equilateral triangle pointing up with a circumradius of 0.5. You can pick any square, and from the center of that square, you can shoot a laser bullet at a rational angle. When the laser hits a triangle, it bounces at a normal reflection, and the triangle then rotates 60 degrees clockwise (after the bullet passes away, so the triangle doesn't hit it while rotating). Before you make the shot, you can manually rotate a finite number of triangles by 60 degrees clockwise. Is this system that I've described Turing complete? Explain why or why not. If it is, then give a specific example of a configuration which simulates a calculation of 11 * 37.
I don't know the answer to this question because I haven't thought about it, but I gave it to Gemini 3 Pro and it claimed that it was Turing complete (and gave a handwavy explanation for why) but it refused to do the 11 x 37 part because it said it's too complicated.
@ItsMe Sure: Rotate 37 distinct triangles, then do this 11 more times. Then count the total number of rotated triangles.
(You may want to specify a particular way the output needs to be read.)
@ItsMe I'll try harder if you think that would be fruitful but I tentatively don't think I can improve on Gemini's answer there. (Ha, nice hack from Isaac King there.)
@IsaacKing Good question. I'll do my best to give a fair assessment of whether the frontier AIs give the right answer but we need to add an FAQ item here for how reliable the AI needs to be in giving the right answer. Shall we say >50% of the time when asked in a fresh session?
Isaac's predicting I'll be pretty reliable for the question he has in mind. Are you arguing for a different threshold than 50% for the AI? If the only math problem we can find where I'm better than the AI is one where the AI also can solve it, just not reliably, that's pretty ambiguous in terms of the spirit of the question.
Maybe we actually want resolve-to-PROB in that case? Of the problems I can reliably solve, take the hardest one for AI. If the best of the frontier models can solve it 50% of the time, that's a resolve-to-50%.
@dreev dunno, I didn't mean to argue for any rule in particular. Just emphasizing that "reliability" is hard to interpret as part of "machines vs humans", since we don't do the "fresh session" thing.