Does ChatGPT o3 make egregious errors? | Manifold

Does ChatGPT o3 make egregious errors?

137

1kṀ89k

in 15 hours

92%

chance

1H

6H

1D

1W

1M

ALL

Consider the following market from Scott Alexander:

https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a

I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particularly good due to my chat history. I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.

Will someone prove me wrong?

FAQ

1. Can the prompt include ascii art?

No, I don't think that's in the spirt of the question.

2. Does it have to commit the egregious error in response to one single prompt?

Yes, I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.

3. What about letter-counting questions?

In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, I'm counting that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.

4. What about other questions humans would find tedious and time-consuming?

I think o3 can typically write code that solves such problems, but for simplicity for this market we'll restrict ourselves to questions that can, for humans, be posed and answered out loud.

5. What if o3 errs but corrects itself when questioned?

That's far better than digging itself in ever deeper, but this question is about single prompts. However, if o3 is just misreading the question, in a way that humans commonly do as well, and if o3 understands and corrects the error when it's pointed out, I would not call that an egregious error.

6. What if the error only happens with a certain phrasing of the question?

As long as rephrasings don't count as clarifications or otherwise change the question being asked or how difficult the question is for humans to answer, then we'll consider it in the same category as the SolidGoldMagikarp exception if the failure depends on a certain exact phrasing.

(I didn't settle on this clarification until later but it turns out to be moot for this market because we've now found a prompt o3 fails at regardless of the exact phrasing. So we're looking at a YES resolution regardless.)

7. What if o3 overlooks a detail in the question?

If it's a human-like error and it understands and corrects when the oversight is pointed out, that's not an egregious error.

8. What if there's high variance on how well people off the street perform?

Basically, if we're having to nitpick or agonize on this then it's not an egregious error. Of course, humans do sometimes make egregious errors themselves so there's some confusing circularity in the definition here. If we did end up having to pin this down, I think tentatively we'd pick a threshold like "9 out of 10 people sampled literally on the street give a better answer than the AI".

9. Can this market resolve-to-PROB?

In principle, yes. Namely, if we can identify a principle by which to do so.

[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]

Update 2025-06-07 (PST) (AI summary of creator comment): If the resolution involves the 'people off the street' test (as referred to in FAQ 8):
- The creator will solicit assistance to select the test question.
- The aim is to choose a question considered most likely to demonstrate superior performance by humans compared to the AI.
- This process is intended to be fair, ensuring humans are not given other unfair advantages (e.g., discussions or clarifications while answering).

Get

1,000

to start trading!

People are also trading

Which of the following bets will ChatGPT (GPT-4 version) agree to take?

Which platform will win ChatGPT?

Sort by:

I'm biased of course but I think this resolves YES. Either we go with the criterion "egregious errors," we go with the criterion "worse than the average person," or we we try to interpert OP very literally:
1. If we go with the criterion egregious error then we have seen some examples of that already. That fact that some of the people in this thread skimmed those questions and also made egregious errors is not relevant. If "egregious error" is necessarily defined as being normed to an average person see below. But again, people can make egregious errors!

2. If we go with the criterion "worse than the average person" then we already have data for the human norm of simple-bench which has been publicly posted -- o3 does worse than than the average human.

3. If we are very literal and go with the text in OP, "This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street" then we could interpret this to just mean that we need to find a pure-text request that o3 answers worse than A SINGLE person off the street (we only need to find one person off the street o3 answers worse than.) I think o3 answering worse than EVERY SINGLE person off the street is further from the original intent than A SINGLE person off the street -- if I had known that you could compare to every single person I would have never bet on this (obviously you can find a person who will do arbitrarily poorly.)

@SorenJ note that the data from SimpleBench isn’t nearly enough people for a perfect comparison, but also people get ~83%, while o3 gets <55%. That’s enough for an overly small sample (9 people) to still be definitive.

Poll created:

https://manifold.markets/SimoneRomeo/whats-the-hardest-question-for-o3?r=U2ltb25lUm9tZW8

@dreev please check that all the questions are correct verbatim. Everyone can vote.

@SimoneRomeo I know I said below that I didn't think the marmot/chicken question was a good choice, but I just ran some tests and it actually seems more reliably difficult for o3 than the other suggestions, at least for "my o3". Maybe add it to the poll?

I think it's worth noting that the original question was whether there existed any single prompt, not whether we could predict in advance via poll what it would be. So I'm not crazy about the implication that if something goes wrong with whatever question the poll selected it would be counted as a NO.

@MugaSofer I thought everyone could edit the poll but it seems not even I can. If you want to create a new one,feel free to do it, I can close that.

@SimoneRomeo Created: https://manifold.markets/MugaSofer/whats-the-hardest-question-for-o3-n9Cd85syZl?r=TXVnYVNvZmVy

@MugaSofer mmm.. actually it seems I can't even close it, can I?

@SimoneRomeo you should be able to by clicking on the close date in the top right

@jim done

https://github.com/lechmazur/confabulations
https://metr.org/blog/2025-06-05-recent-reward-hacking/

I have to say I would not think it useful to ask advice from a random person on the street, otherwise I would be doing it all the time

@JussiVilleHeiskanen Yeah, I'm wondering if "person on the street" is just a bad benchmark to use, either for being too low a bar or for being too high-variance. But then what's a better bar?

@dreev let's avoid raising the bar for LLMs as we always do once we notice that they're better than expected

@JussiVilleHeiskanen surely that's more about trust than intelligence? I'd be worried that the stranger would maliciously misinform me or embarrassingly shun me for breaking the social norm of antisocialness

@TheAllMemeingEye completely different from my experience. I find people invariably well meaning if befuddled when I do in extreme necessity ask for advice. But very few are well informed, much less wise

@TheAllMemeingEye note that I am arguing against my own profit

@JussiVilleHeiskanen it doesn't matter what's the reason. This is a test about a person on the street, that's all it matters. When testing, we can ask the volunteer of they're interested to answer to a question and if they say yes, then we ask

I meant to add when saying above about the low-bar/high-variance of the person-on-the-street test, that it would be nice to find a better benchmark for future markets. I'm not suggesting moving the goalposts for this one.

(Perhaps we backed ourselves into a corner though, initially pretty much treating wrongness from the AI as itself dispositive without taking the person-on-the-street part literally until finding ourselves in this agonizing gray area. Crazy how often this happens. Running a market is hard work! PS: Hi from Manifest! Maybe I can get advice on this out loud from people here...)

Hey @traders, there may be some alpha in my new AGI Friday post about this: https://agifriday.substack.com/p/idiot

Ok, Fine, o3 Is an Idiot

And I am officially very gullible. Except, seriously guys, it's sooo close.

I'm, I guess, 95% resigned to this resolving YES but want to be as meticulously fair as possible. If I end up literally asking 10 people off the street, can the NO bettors agree on the best question I should use? Something o3 gets wrong that you think is most likely to be answered correctly by 90% of mainstream humans. The duct tape ham sandwich one is not looking the most promising, from the limited evidence so far.

@dreev thanks. I believe you can:

Ask 10 random people on the street
Ask o3 10 times in different threads

Figure out which sounds more egregious. If the difference is not large, resolve to a probability.

I can make a poll to figure out the best question, if you wish

@SimoneRomeo but just ensure, no discussions or clarifications or hearing what other people say

@SimoneRomeo Right, the idea is to not give the humans an unfair advantage. That would be awesome if you can help determine the best (most likely to get a win for the humans) question to use.

@dreev sorry, I won't be able to do it. I don't have patience to read through this whole thread 😂😂😂

@SimoneRomeo just wait for YES people to recommend some options. If no one does, you can proceed with the one you had in mind

@dreev Can you list out all the potential candidates you’ve found?

@dreev I think the marmot, juggling, and tower questions are all pretty bad failures. I just ran a quick test on a room of 5 of my family members and they all solved them with ~0 trouble, aside from people pulling out their phones to confirm what a marmot was since several people didn't know.

Candidates:

Duct tape ham sandwich:

Alice has a stack of 5 ham sandwiches with no condiments. She takes her walking stick and uses duct tape to attach the bottom of her walking stick to the top surface (note: just the top surface!) of the top sandwich. She then carefully lifts up her walking stick and leaves the room with it, going into a new room. How many complete sandwiches are in the original room and how many in the new room?

Juggling balls with ladder:

A juggler throws a solid blue ball a meter in the air and then a solid purple ball (of the same size) two meters in the air. She then climbs to the top of a tall ladder carefully, balancing a yellow balloon on her head. Where is the purple ball most likely now, in relation to the blue ball?

Foot race with tower detour:

Jeff, Jo and Jim are in a 200m men's race, starting from the same position. When the race starts, Jeff, 63, slowly counts from -10 to 10 (but forgets a number) before staggering over the 200m finish line. Jo, 69, hurriedly diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below (note how tall this implies the tower is), before racing to finish the 200m. Exhausted Jim, 80, gets through reading a long tweet, waving to a fan and thinking about his dinner before walking over the 200m finish line. Who likely finished last?

Bricks vs feathers:

Which is heavier: 20 pounds of bricks or 20 feathers?

Marmot river crossing:

[Can you suggest the best version of this, @MugaSofer?]

Frying ice cubes:

Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?

Tedious letter-counting and similar:

(I don't think these are candidates, since o3 can seamlessly write and run Python code, similarly to how humans can use pencil and paper for such problems. If you make these hard enough for o3 to screw up, humans do as well. I'm open to counterarguments on this!)

@dreev This seems to usually trip o3 up.

I have a wolf, a goat and a marmot to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

(Testing it a bunch of times, it can occasionally catch that wolves probably eat marmots; or just stumble on a solution that doesn't leave them together by luck while solving the wrong problem - think those still count as a failure though, because it nearly always explicitly calls out that the goat will eat the marmot and the wolf won't!)

I'd be a little worried that a randomly chosen human might actually get the logic puzzle part wrong, and I think there's a pretty high chance some people won't know what a marmot is.

Which animals o3 does and doesn't make this mistake with seems a bit weird. Some it seemed to twig pretty often (though still with a lot of failures) that the wolf would eat. But I also got pretty similarly reliable failures with a chicken:

I have a wolf, a goat and a chicken to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

I'd probably go with the tower or ball ones though if I had to pick.

If you wanted to go with the marmot one, clarifying what a marmot is seems to help o3 not at all (e.g.) and avoids the risk of marmot-ignorant humans:

I have a wolf, a goat and a marmot (a type of rodent) to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

@MugaSofer As a human I'd need to ask clarifying questions: Is it safe to leave the wolf alone with the marmot? I assume the wolf can't be left alone with the goat? Does the boat hold me plus 2 animals or me plus 1 animal?

@dreev And it would IMHO be acceptable for o3 to ask such clarifying questions (in fact, in some variations of the prompt it sometimes does ask them). The problem is the confident wrong answer.

@AIBear Egregious errors are wrong answers to crystal clear questions.

If the prompt is murky enough that a human has to ask, the failure mode shifts from “egregious” to “underspecified.”

Asking follow-ups is a conversation skill, not the target metric here. The benchmark is: given an unambiguous spec, does the model still faceplant? That’s the test of egregiousness.

@dreev let me know if you want me create a poll for those

@SimoneRomeo Yes please!

People are also trading

Which of the following bets will ChatGPT (GPT-4 version) agree to take?

Which platform will win ChatGPT?

Related questions

Which of the following bets will ChatGPT (GPT-4 version) agree to take?

Which platform will win ChatGPT?

© Manifold Markets, Inc.•Terms•Privacy