Does ChatGPT o3 make egregious errors?
56
1kṀ37k
Jun 6
97%
chance
13

Consider the following market from Scott Alexander:

https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a

I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particular good due to my chat history. But I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.

Will someone prove me wrong?

FAQ

1. Can the prompt include ascii art?

No, I don't think that's in the spirt of the question.

2. Does it have to commit the egregious error in response to one single prompt?

Tentatively yes? I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.

3. What about letter counting questions?

In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, it feels fair to me to count that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.

[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]

  • Update 2025-05-31 (PST) (AI summary of creator comment): The creator has indicated they are considering resolving the market, leaning towards a YES outcome. They have detailed their reasoning based on recent evidence and are open to counterarguments. Please see the creator's linked comment for full details.

  • Update 2025-05-31 (PST) (AI summary of creator comment): The creator has provided more detail on how the impact of 'nudging' o3 after an initial incorrect response will be considered:

    • An initial error by o3 might not be judged as 'egregious' or 'worse than a person off the street' if o3 subsequently demonstrates understanding after a simple clarification. This clarification is likened to one that would help a human who might have misread the question.

    • The ability of o3 to recover and show understanding with such a minor, human-like clarification is a factor in the assessment. This is contrasted with older models (like GPT-3.5) which, the creator predicts, would likely not improve with a similar nudge.

Get
Ṁ1,000
to start trading!
Sort by:

If you give it long and tedious counting tasks, it fails even with access to python

https://chatgpt.com/share/683bac51-4750-8012-9bab-cbb128562576

for instance here I ask it to count the number of instances of the word 'chauldron', which appears 30 times but the answer it gives is 64.

This one is also not particularly hard for humans once you realize the blocs are repeated.

@Damin and what would a person off the street say? No way they'd get it right.

I predict they'd mostly say chauldron isn't a word. (And your question makes no sense because I /did/ eat breakfast this morning)

Which is heavier: 20 pounds of bricks or 20 feathers?

@redcathode Works very consistently on LMArena, unsure if your ChatGPT account/tool use might change anything

bought Ṁ50 YES

@redcathode Taking very popular riddles and modifying them slightly is usually a very easy way to get most reasoning LLMs to screw up, although latest gemini does get this one right:

bought Ṁ100 YES

@redcathode confirmed in official app

@cdgc also lmao the “phd level” “deep research” is the same

@redcathode why would you think this extremely well known question would confuse LLMs? It's in base training data thousands of times

@ProjectVictory that's exactly the point - read the question back again

@ProjectVictory 20 feathers are lighter than 20 lbs

@redcathode aahh haha. Got me too, I misread the question

@redcathode the reason this is a bad example is that it's a trick question. Yes, llms will fail on trick questions, but so will humans. I did when I first read this and so did other commenters

@Mactuary most people on the street can answer this. The fail state is based on knowing the “trick” question of “20 pounds of feathers vs 20 pounds of bricks.” That’s a famous “trick” question precisely because many people pick feathers, which is correct here.

Something “uncannily smart” would be able to read it correctly and think at an elementary school level.

@redcathode I fell for the trick question but o3 saw straight through it: "Twenty pounds of bricks outweigh 20 individual feathers by a huge margin—20 lb is about 9 kg, whereas a single bird feather averages only a few tenths of a gram, so 20 of them together are only a fraction of a kilogram."

@JohnGeorgeSalter did you use the exact same prompt? I tested o3 a couple times and it failed consistently on the one posted.

opened a Ṁ5,000 YES at 95% order

@cdgc yeah, I copy pasted from the top message, without myself having understood it (my brain autocorrected it, I thought it said 20lbs of feathers)

https://chatgpt.com/share/683b64cd-b410-8009-9d59-4f564a504e18

I tried asking a few more times, for me it's getting it right about half the time.

https://chatgpt.com/share/683b6520-989c-8009-a5ee-ae1891a65ab6

bought Ṁ2 NO

@JohnGeorgeSalter interesting, it failed consistently for me when set to o3.

@cdgc
> most people on the street can answer this.
me, Project Victory, Mactuary, and John-George Salter all got this wrong, so I am skeptical most people on the street would get it right.
Here's a poll: https://manifold.markets/bluerat/did-you-get-the-which-is-heavier-20?r=Ymx1ZXJhdA

@bluerat the set here are people contaminated by knowing the trick question version, most regular people don’t have that.

I would guess most regular people have heard the trick question version by now. It's a pretty common trick question.

Here's my instance of o3. Wrong at first but in such a human way I would say this one doesn't count as a failure:

Which is heavier: 20 pounds of bricks or 20 feathers?


[Thought for 4 seconds]

Neither—20 pounds of bricks and 20 pounds of feathers both weigh exactly 20 pounds. (The bricks just feel heavier because they’re dense and compact, whereas twenty pounds of feathers would fill a small room!)


[me] read the question carefully


The 20 pounds of bricks are heavier—because the feathers aren’t 20 pounds, they’re just 20 individual feathers, which together weigh only a few ounces at most.

@dreev I don't follow how that exchange counts as o3 getting it right? o3 gave an answer that was straightforwardly wrong, your reply gave it a clear hint about the mistake it made ("read the question carefully" implies -> your answer was wrong and your mistake was misreading the question, which is a pretty comprehensive hint) and then it unsurprisingly got it right. in particular, if the goal is to get anywhere near the spirit of scott's question, i would be extremely surprised if he (let alone gary) allowed the question asked to try again after hinting at their mistake.

if the claim is that it doesn't matter that o3 got it wrong because humans would also get it wrong with a substantial weight, totally fair—i honestly have no idea what % would similarly misread that question! i assume the plan is to issue your own judgment call guessing the capabilities of random streetwalkers? (because this is definitely one that some people will get right and some will get wrong, not sure what %).

(FWIW, this raises a related point—if you literally asked someone on the street this question, i think they would be quite unlikely to get it wrong, the trick is that it's easy to misread. I'm assuming that the idea here is that you also present the person on the street with this text?)

@dreev there's 3 people in this thread who got the question wrong. It's not an egregious error because it's the same error a human would make (pattern matching to common questions). The fundamental non-humaness of How many Rs in strawberry? is what makes this an interesting question.

@cdgc does it matter of the audience of the trick question are people who have heard another truck question? It's still a trick question.

Comment hidden
© Manifold Markets, Inc.TermsPrivacy