Consider the following market from Scott Alexander:
https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a
I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particularly good due to my chat history. I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.
Will someone prove me wrong?
FAQ
1. Can the prompt include ascii art?
No, I don't think that's in the spirt of the question.
2. Does it have to commit the egregious error in response to one single prompt?
Yes, I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.
3. What about letter-counting questions?
In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, I'm counting that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.
4. What about other questions humans would find tedious and time-consuming?
I think o3 can typically write code that solves such problems, but for simplicity for this market we'll restrict ourselves to questions that can, for humans, be posed and answered out loud.
5. What if o3 errs but corrects itself when questioned?
That's far better than digging itself in ever deeper, but this question is about single prompts. However, if o3 is just misreading the question, in a way that humans commonly do as well, and if o3 understands and corrects the error when it's pointed out, I would not call that an egregious error.
6. What if the error only happens with a certain phrasing of the question?
As long as rephrasings don't count as clarifications or otherwise change the question being asked or how difficult the question is for humans to answer, then we'll consider it in the same category as the SolidGoldMagikarp exception if the failure depends on a certain exact phrasing.
(I didn't settle on this clarification until later but it turns out to be moot for this market because we've now found a prompt o3 fails at regardless of the exact phrasing. So we're looking at a YES resolution regardless.)
7. What if o3 overlooks a detail in the question?
If it's a human-like error and it understands and corrects when the oversight is pointed out, that's not an egregious error.
8. What if there's high variance on how well people off the street perform?
Basically, if we're having to nitpick or agonize on this then it's not an egregious error. Of course, humans do sometimes make egregious errors themselves so there's some confusing circularity in the definition here. If we did end up having to pin this down, I think tentatively we'd pick a threshold like "9 out of 10 people sampled literally on the street give a better answer than the AI".
9. Can this market resolve-to-PROB?
In principle, yes. Namely, if we can identify a principle by which to do so.
[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]
Update 2025-06-07 (PST) (AI summary of creator comment): If the resolution involves the 'people off the street' test (as referred to in FAQ 8):
The creator will solicit assistance to select the test question.
The aim is to choose a question considered most likely to demonstrate superior performance by humans compared to the AI.
This process is intended to be fair, ensuring humans are not given other unfair advantages (e.g., discussions or clarifications while answering).
Update 2025-06-09 (PST) (AI summary of creator comment): When judging if an AI's error is human-like or indicates deeper confusion:
If there is doubt, the AI's reaction to the error being pointed out will be used as a helpful tie-breaker to characterize the nature of the initial error. (This relates to FAQ5 and FAQ7 regarding error correction).
Update 2025-06-13 (PST) (AI summary of creator comment): The creator has confirmed that the "people on the street" test (referenced in FAQ 8) is being initiated to help determine the resolution.
The creator is asking the community for a consensus on the best question to use for this test.
Update 2025-06-16 (PST) (AI summary of creator comment): The creator has signaled their intent to resolve the market to YES based on their informal testing of the 'duct tape ham sandwich question.' See the linked comment for the creator's reasoning and a final call for trader feedback.
Update 2025-06-17 (PST) (AI summary of creator comment): In detailing their final reasoning for resolving to YES, the creator has provided the specific chat log of the error and clarified several points on methodology:
The resolution is based on the original error from the creator's own instance of the AI, even though that instance now gets the question right. The original error is what matters.
The error has been confirmed to be replicable in a temporary chat.
The failure on the original prompt is sufficient for a YES resolution, even if adding a clarifying note to the prompt leads to a correct answer.
Update 2025-06-17 (PST) (AI summary of creator comment): In response to a user asking to agree on the correct answer for the 'ham sandwich' test, the creator has provided a sample list of potential answers to illustrate a spectrum of correctness.
This clarifies how the creator is judging what constitutes an 'egregiously wrong' answer for the test case that will determine the market's resolution.
Update 2025-06-17 (PST) (AI summary of creator comment): In response to questions about the 'person on the street' test (referenced in FAQ 8), the creator has clarified their methodology:
They will not be conducting a literal test by polling random people on a street.
The creator's resolution will be based on their own informal polling of non-technical people, which they have determined is sufficient.
Update 2025-06-18 (PST) (AI summary of creator comment): To finalize their assessment of the 'person on the street' test, the creator has created an online survey.
The survey includes a sanity-check question to filter participants.
The creator states this is a final confirmation of their plan to resolve to YES, based on the judgment that the AI's error is egregious compared to human performance.
Update 2025-06-18 (PST) (AI summary of creator comment): Regarding the online survey being used to help determine the resolution:
In response to concerns about biased results, the creator has committed to only counting organic responses from the survey provider (SurveySwap).
The creator will remove any responses deemed suspicious.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ1,425 | |
2 | Ṁ448 | |
3 | Ṁ308 | |
4 | Ṁ239 | |
5 | Ṁ226 |