Consider the following market from Scott Alexander:
https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a
I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particularly good due to my chat history. I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.
Will someone prove me wrong?
FAQ
1. Can the prompt include ascii art?
No, I don't think that's in the spirt of the question.
2. Does it have to commit the egregious error in response to one single prompt?
Yes, I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.
3. What about letter-counting questions?
In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, I'm counting that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.
4. What about other questions humans would find tedious and time-consuming?
I think o3 can typically write code that solves such problems, but for simplicity for this market we'll restrict ourselves to questions that can, for humans, be posed and answered out loud.
5. What if o3 errs but corrects itself when questioned?
That's far better than digging itself in ever deeper, but this question is about single prompts. However, if o3 is just misreading the question, in a way that humans commonly do as well, and if o3 understands and corrects the error when it's pointed out, I would not call that an egregious error.
6. What if the error only happens with a certain phrasing of the question?
As long as rephrasings don't count as clarifications or otherwise change the question being asked or how difficult the question is for humans to answer, then we'll consider it in the same category as the SolidGoldMagikarp exception if the failure depends on a certain exact phrasing.
(I didn't settle on this clarification until later but it turns out to be moot for this market because we've now found a prompt o3 fails at regardless of the exact phrasing. So we're looking at a YES resolution regardless.)
7. What if o3 overlooks a detail in the question?
If it's a human-like error and it understands and corrects when the oversight is pointed out, that's not an egregious error.
8. What if there's high variance on how well people off the street perform?
Basically, if we're having to nitpick or agonize on this then it's not an egregious error. Of course, humans do sometimes make egregious errors themselves so there's some confusing circularity in the definition here. If we did end up having to pin this down, I think tentatively we'd pick a threshold like "9 out of 10 people sampled literally on the street give a better answer than the AI".
9. Can this market resolve-to-PROB?
In principle, yes. Namely, if we can identify a principle by which to do so.
[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]
Update 2025-06-07 (PST) (AI summary of creator comment): If the resolution involves the 'people off the street' test (as referred to in FAQ 8):
The creator will solicit assistance to select the test question.
The aim is to choose a question considered most likely to demonstrate superior performance by humans compared to the AI.
This process is intended to be fair, ensuring humans are not given other unfair advantages (e.g., discussions or clarifications while answering).
Update 2025-06-09 (PST) (AI summary of creator comment): When judging if an AI's error is human-like or indicates deeper confusion:
If there is doubt, the AI's reaction to the error being pointed out will be used as a helpful tie-breaker to characterize the nature of the initial error. (This relates to FAQ5 and FAQ7 regarding error correction).
Update 2025-06-13 (PST) (AI summary of creator comment): The creator has confirmed that the "people on the street" test (referenced in FAQ 8) is being initiated to help determine the resolution.
The creator is asking the community for a consensus on the best question to use for this test.
People are also trading
@SimoneRomeo You mean literally 10 people on the street? Yes. Any consensus on the best question to use?
@JimHays I wonder if chatbots will ever be allowed to take screenshots when they determine they need to perform visual analysis on the prompt.
@Haiku It seems like the range of inputs where that would be helpful pretty small, but yeah, like this one is dependent on the particular font that is used. With a different font, it could easily go the other way. The answer may depend on things like the device, browser, and browser zoom settings. ChatGPT may well not know what font stack is being used on its website, let alone what font from the font stack is being rendered in this specific conversation.
Though I think a good response to this question would have acknowledged its own limitations, rather than guessing
@LuluHowell Yeah, all the errors from o3 that I've seen on tricksy questions like that are quite human-like. See also FAQ5. When in doubt about whether the AI has made a human-like error or is more deeply confused, seeing how it reacts to the error being pointed out can be a helpful tie-breaker.
@dreev models don’t have a mechanism for being “confused,” these are collections of probabilities and the act of pointing out an error may be present increases the probability of other responses being generated as one does not point out errors for correct information.
Explicitly, questioning about correct information increases the probability of seeing a hallucination and questioning hallucinations increases the probability of landing on a correction.
I'm biased of course but I think this resolves YES. Either we go with the criterion "egregious errors," we go with the criterion "worse than the average person," or we we try to interpert OP very literally:
1. If we go with the criterion egregious error then we have seen some examples of that already. That fact that some of the people in this thread skimmed those questions and also made egregious errors is not relevant. If "egregious error" is necessarily defined as being normed to an average person see below. But again, people can make egregious errors!
2. If we go with the criterion "worse than the average person" then we already have data for the human norm of simple-bench which has been publicly posted -- o3 does worse than than the average human.
3. If we are very literal and go with the text in OP, "This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street" then we could interpret this to just mean that we need to find a pure-text request that o3 answers worse than A SINGLE person off the street (we only need to find one person off the street o3 answers worse than.) I think o3 answering worse than EVERY SINGLE person off the street is further from the original intent than A SINGLE person off the street -- if I had known that you could compare to every single person I would have never bet on this (obviously you can find a person who will do arbitrarily poorly.)
@SorenJ note that the data from SimpleBench isn’t nearly enough people for a perfect comparison, but also people get ~83%, while o3 gets <55%. That’s enough for an overly small sample (9 people) to still be definitive.
@SorenJ “we could interpret this to just mean that we need to find a pure-text request that o3 answers worse than A SINGLE person off the street”
This isn’t a math proof, so that’s not what that expression means in this context. It means a hypothetical representative person off the street, which could be operationalized in multiple ways.
@JimHays I said "we could interpret this" after previousy discussing the more relevant hypothetical representative person off the street
@DavidHiggs Yeah, its a small sample size, but people here are talking about going to the street and taking an N=1 sample for just 1 question, so its the best we have.
@SorenJ we already discussed and agreed about the resolution criteria a few days ago, you can find them below
@SimoneRomeo I don't see any agreement, I just see @dreev saying, "If I end up literally asking 10 people off the street, can the NO bettors agree on the best question I should use?"
Poll created:
https://manifold.markets/SimoneRomeo/whats-the-hardest-question-for-o3?r=U2ltb25lUm9tZW8
@dreev please check that all the questions are correct verbatim. Everyone can vote.
@SimoneRomeo I know I said below that I didn't think the marmot/chicken question was a good choice, but I just ran some tests and it actually seems more reliably difficult for o3 than the other suggestions, at least for "my o3". Maybe add it to the poll?
I think it's worth noting that the original question was whether there existed any single prompt, not whether we could predict in advance via poll what it would be. So I'm not crazy about the implication that if something goes wrong with whatever question the poll selected it would be counted as a NO.
@MugaSofer I thought everyone could edit the poll but it seems not even I can. If you want to create a new one,feel free to do it, I can close that.