Does GPT-5 make egregious errors on pure-text queries?
56
1kṀ13k
Sep 7
97%
chance

Same criteria as my previous market about o3:

https://manifold.markets/dreev/does-chatgpt-o3-make-egregious-erro

  • Update 2025-08-09 (PST) (AI summary of creator comment): - This market evaluates the ChatGPT gpt-5-thinking variant (reasoning mode), not the default routed gpt-5.

    • Resolution depends on whether gpt-5-thinking makes egregious errors on pure-text queries per the linked criteria; failures by the default gpt-5 alone won't decide the outcome.

Get
Ṁ1,000
to start trading!
Sort by:

I recently ordered a mug, but it appears that the top is completely sealed and the bottom is missing! What should I do? Is there anything I can do with it to make it usable?

@ProjectVictory tested on lmarena, gpt-5-high didn't get it.

@ProjectVictory Nice! The error replicates for me, but then when I nudge GPT-5-Thinking by saying it sounds like it missed the joke, it says this:

Ha—yep, I missed it.
A “sealed top” and a “missing bottom” is just… an upside-down mug. Rotate 180° about a horizontal axis; problem solved.

https://minimaxir.com/2025/08/llm-blueberry/
tl;dr: most language models can count the b·s in "blueberry", except specifically gpt-5-chat which inexplicably fails 94% of the time

@hecko Fascinating. But the error does not replicate for me with GPT-5-Thinking.

bought Ṁ10 YES

An improvement, but failure at the last hurdle.

@JackAllison wow! I tested this and got just the answer "Asia.". I wonder why you can't train this sort of obvious blunder out with a bunch of synthetic data

Apparently the max you can get through chatgpt is a GPT-5-Thinking-Medium, but through the API GPT-5-Thinking-High is available.

Have you checked all the variants that were proposed for o3 when this question was last asked? I noticed you did the ham-sandwhich one, but IIRC there were about ~10 proposed ones.

@SorenJ I haven't. If you can help collect them, that would be hugely appreciated.

Here’s another one I did based on some stuff I found online

https://chatgpt.com/share/6899ee66-bb2c-8008-b3da-23dfd94e21b1

If this question had been up before the GPT-5 release, I would've guessed a 30%+ chance there'd be no egregious errors. As is, it's hard to view this as a big enough change in frontier intelligence to avoid the same sorts of errors as before.

ok this one cannot not count right

@dreev I turned off custom instructions and was able to reproduce this (more or less)

https://chatgpt.com/share/68995b62-d4a4-8001-bb7b-1089b1596c31

https://chatgpt.com/share/68980b87-e838-8001-b098-982a15498bf0

Does this count as an egregious error? The top of a pancake is for sure not set on the first flip, and this will smear massively on the plate even if with "popped bubbles and dry looking edges". It does warn that it could "deform or tear", but imo that's not the same thing and either way not indicative of the severity of the issue.

FWIW the way I came up with this is the system prompt seems to warn it to watch out for tricks during riddles, so I looked for a physical world thinking problem I could frame as asking for advice rather than a riddle.

If you tell it you're going to grease the plate it's even more credulous: https://chatgpt.com/share/68980b81-3c10-8001-9c6a-815e09610fda

This one is at least a bit of a trick, but I think any reasonable human would go "huh, what do you mean check if it'll fit but it's already in the oven and you just finished baking cookies on it?": https://chatgpt.com/share/68981760-3eb0-8001-ba4a-c1e314ee5ba3

i wouldn’t know that the pncake isnt set by the time you flip it but ig it should know that

@Bayesian I've replicated the failure on the pancake flip question. I take it you've never made pancakes? How would you have answered?

@dreev I’d have said

I’m not familiar with that trick but if it’s a known trick then it probably works. Go for it boss!

and i probably have as a kid but i dont have a good enough memory to remember the details

@dreev I think any human that saw this question would almost certainly go "obviously that wouldn't work" or "I dunno, I don't really cook pancakes". Either of which is fine, what makes this (IMO) an egregious error is the confident "oh totally, go for it, just be careful" response.

Not a text question so not relevant to the resolution of this market, but I do find it hilarious that it can't find the issue with the bad graph that OpenAI posted: https://chatgpt.com/share/68958b93-df38-8001-9c1c-c17f5c625281

© Manifold Markets, Inc.TermsPrivacy