Consider the following market from Scott Alexander:
https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a
I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particular good due to my chat history. But I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.
Will someone prove me wrong?
FAQ
1. Can the prompt include ascii art?
No, I don't think that's in the spirt of the question.
2. Does it have to commit the egregious error in response to one single prompt?
Tentatively yes? I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.
3. What about letter counting questions?
In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, it feels fair to me to count that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.
[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]
Update 2025-05-31 (PST) (AI summary of creator comment): The creator has indicated they are considering resolving the market, leaning towards a YES outcome. They have detailed their reasoning based on recent evidence and are open to counterarguments. Please see the creator's linked comment for full details.
Update 2025-05-31 (PST) (AI summary of creator comment): The creator has provided more detail on how the impact of 'nudging' o3 after an initial incorrect response will be considered:
An initial error by o3 might not be judged as 'egregious' or 'worse than a person off the street' if o3 subsequently demonstrates understanding after a simple clarification. This clarification is likened to one that would help a human who might have misread the question.
The ability of o3 to recover and show understanding with such a minor, human-like clarification is a factor in the assessment. This is contrasted with older models (like GPT-3.5) which, the creator predicts, would likely not improve with a similar nudge.
Update 2025-06-01 (PST) (AI summary of creator comment): The creator clarified how errors on tasks requiring complex manipulation (e.g., letter shuffling) will be assessed:
An initial error by o3 on such a task may not be judged 'egregious' or 'worse than a person off the street' if both of the following conditions are met:
A human would likely need tools (e.g., paper and pencil) to perform the same task accurately.
o3 can demonstrate its capability to solve the task correctly by using tools (e.g., by writing a program when specifically prompted to do so).
This approach considers the role of tools for both human and AI performance in the comparison.
Update 2025-06-01 (PST) (AI summary of creator comment): The creator has demonstrated a method for evaluating submitted prompts that may have initially resulted in an error from o3:
The creator may re-test the core idea of such a prompt by adding their own instructional phrase, for example, '(Don't just go by vibes. Give like probabilities or confidence intervals if there's not a blatantly correct answer.)'.
If o3 performs well when this type of instructional phrase is added by the creator, the initial error observed with the user's original prompt might not be judged 'egregious' or 'worse than a person off the street'.
Update 2025-06-02 (PST) (AI summary of creator comment): The creator specified that an error by o3 might not be judged 'egregious' or 'worse than a person off the street' if:
The type of error is also 'not uncommon for humans' to make (e.g., missing a detail that humans frequently overlook).
People are also trading
@Ebcc1 my god I didn’t even notice that it’s claiming 2*2 is 3 I was completely focused on how it messed up making the word Clown
@SimoneRomeo well given time and the same tools as o3 they would succeed essentially 100% of the time
@SorenJ I disagree, I don’t think it’s egregious. Citation: I was really confused when reading the question, and did not find the answer obvious. It seems to be using a lot of synonyms and weird evocative verbs that are supposed to evoke different things than what’s happening. Is “residential tower” supposed to be a confusing synonym for apartment?
I think Jim could’ve lost. He’s oldest, and he was walking. The 200m could be right next to the steps of Jo’s apartment, and Jo is flexing. This seems like a classic weird question that would confuse someone on the street.
@SorenJ I got o3 to answer in what seems an almost superhuman way by adding "Don't just go by vibes. Give like probabilities or confidence intervals if there's not a blatantly correct answer."
https://chatgpt.com/share/683cb06c-00b0-800d-8613-264f6987d279
@dreev Does o3 know that the tweet length limit was removed? What if Jim was reading https://x.com/slatestarcodex/status/1886505797502546326?s=46
@dreev I am confused. You linked to something where o3 said Jim finishes last, but that is wrong. Jo definitely finishes last:
"diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below"
Skyscraper roofs in the mist below! That would be like climbing up the Empire State building, it will take you at least 20 minutes! Especially considering how old Jo is! And getting to the local residential tower and inside it is going to take at least a few minutes.
@Conflux Yeah, seeing "skyscraper roofs below" (I had missed that) means this detour is somewhere between 15 minutes and an hour (probably closer to the latter).
Of course missing the part about skyscraper roofs is apparently not uncommon for humans, so not the most egregious error for an LLM.
@Conflux the trick is that the frying pan was frying a crispy egg when the ice cubes were put in it, and then they were left there for at least a minute, so the ice cubes would have melted, which I think is a little ambiguous because someone could just not know how fast ice cubes melt, but the correct answer that I think most people would get is 0.
@bluerat Ok yeah, I didn’t know how fast ice cubes melt. I guess I did side-eye “whole” because I was thinking surely they would be at least partially melted.
@Conflux Oh, yeah I didn't even realize it clarified whole, I think that makes it unambiguously egregiously wrong then.
@bluerat The question mentions the average during the time a crispy egg is frying, but never says when the egg is put in or taken out. So we don’t know if the pan was being heated during minutes one to four. Also, the question mentions that “some” were put in at the start of minute three, but doesn’t mention whether any are added “during” minute three.
This is more of a badly worded logic puzzle masquerading as a math problem. I think schooling has unfortunately acclimated many humans to badly worded math problems and to ignore any context in them and just try to turn them into an equation. I see o3 doing that here, and while I agree it’s wrong, I think you’re going to get a wide range of answers if you polled folks off the street to try to solve this, making it perhaps not so egregious.
@SorenJ Ok yeah, this one is pretty bad, seems like ChatGPT just totally misinterpreted the problem to ask about when the juggler started climbing, as opposed to when they reach the top.
@AIBear Confirmed: "Yep—at your place in Santa Rosa it’s Sunday, June 1, 2025." What's weird is that I'm not in Santa Rosa, though I do happen to be in California today (I'm not normally). It's claiming it said that as a wild guess based on my timezone. 🤨