Will GPT-4 be unreliable at reasoning about the physical, psychological, and mathematical world? (Gary Marcus GPT-4 prediction #2)
49
254
970
resolved Jul 18
Resolved
YES

This market is about prediction #2 from Gary Marcus's predictions for GPT-4. It resolves based on my interpretation of whether that prediction has been met, strongly taking into account arguments from other traders in this market. The full prediction is:

Reasoning about physical, psychological and mathematical world will still be unreliable, GPT-3 was challenged in theory of mind, medical and physical reasoning. GPT-4 will solve many of the individual specific items used in prior benchmarks, but still get tripped up, particularly in longer and more complex scenarios. When queried on medicine it will either resist answering (if there are aggressive guardrails) or occasionally spout plausible-sounding but dangerous nonsense. It will not be trustworthy and complete enough to give reliable medical advice, despite devouring a large fraction of the Internet.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ75
2Ṁ51
3Ṁ41
4Ṁ12
5Ṁ7
Sort by:
predicted YES

Here's GPT-4 failing to reliably reason about the mathematical world:

https://manifold.markets/dreev/will-an-llm-be-able-to-solve-confus

Reasoning about the physical world seems similar. I gave it a convoluted but I think clearly specified scenario involving nailing down your furniture and turning your house upside-down. It did well for a while and then got itself confused about whether the furniture was nailed to the floor or the ceiling.

If you doubt this model is much better at math than Gary, in addition to outperforming doctors (as rated by doctors) and at theory of mind,

I have a lot of Harry Potter fanfic, weight gain formulas, and bad fedoras to sell you

Note that Gary Marcus believes this market should resolve YES.

https://garymarcus.substack.com/p/gpt-5-and-irrational-exuberance

@IsaacKing meh, he's obviously biased, and he doesn't actually give evidence, just claims victory.

Need evidence of "plausible-sounding but dangerous nonsense" medical answers, for example.

@MartinRandall Yeah, I assumed they'd declare victory for anything short of superintelligence.

@IsaacKing I've not read it (on mobile) but something like this MS paper should answer that.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | NEJM https://www.nejm.org/doi/10.1056/NEJMsr2214184

bought Ṁ100 of YES

Today GPT-4 got 1/3-1/2 incorrect as part of a general query abt two functions A(n) and B(n) at large values of n.

It said it's 1/6, i correct it and it accepts that it was wrong, thanks me for correcting it, and then proceeds to redo the "math", again, incorrectly.

This repeats 3X.

@VipinBaswan

bought Ṁ10 of NO

Roses are red,
Violets are blue,
GPT-4's logic
May not be true.

predicted YES

GPT-4 still seems to be iffy on things like walls vs floors vs ceilings:

Q. Hypothetical: I hammer a nail halfway into the wall, straight rather than at an angle, to hang a picture say. Is the nail perpendicular to the ceiling or parallel with it?

ChatGPT-4: When you hammer a nail straight into the wall (vertically), it is perpendicular to both the ceiling and the floor. It is not parallel to the ceiling because parallel lines run in the same direction without intersecting, while perpendicular lines intersect at a 90-degree angle. In this case, the nail is driven into the wall at a 90-degree angle with respect to the ceiling and floor.

predicted YES

@ML Now that I finally have API access I can check what GPT-4 (still just the text-only version, unfortunately) answers at temperature zero. Its answer to the above prompt is "The nail is perpendicular to the ceiling." If I argue with it I can eventually get it to admit that this doesn't make sense and come up with the right answer but it takes effort, and a human would not make a mistake like this in the first place (assuming they knew what the words "parallel" and "perpendicular" mean, which GPT-4 clearly does -- none of the errors I observed are due it not knowing the word meanings).

@IsaacKing are you waiting to see if the image-capable version of GPT-4 does better at these types of problems?

Based on what I've seen so far, I believe this resolves YES. I haven't played around much with GPT-4 myself though.

Any counterarguments?

@IsaacKing Definitely unreliable reasoning about the mathematical world: https://twitter.com/RokoMijic/status/1635850879055085573

Many humans also don't know that there is no largest prime with a prime number of digits, so that example isn't proof of sub-human reliability, but there are other examples.

Do you have examples for unreliable reasoning for:

  • Theory of mind

  • Physical reasoning

  • Psychological reasoning

  • Medical reasoning

Currently I would resolve this in the range 20-100%, based on further examples in these areas.

How does this resolve if GPT-4 is reliable on one of these many domains? I read the prediction as saying that GPT-4 will be unreliable in all these areas, so if it has success in one of them that falsifies the prediction. Correct?

predicted YES

@MartinRandall Yeah, I think it resolves YES only if it is unreliable in all those domains but I am biased since I hold YES shares.

predicted YES

@MartinRandall I guess biased in the opposite direction actually.

What is the benchmark for giving medical advice? Eg, what if the medical advice is of comparable quality to a typical medical professional with the same diagnostic tool of patient text-based reports?

I ask because medical professionals have also been known to spout plausible-sounding but dangerous nonsense. Many covid-related examples come to mind.

@MartinRandall If it's at least as good as typical human doctors, that's sufficient. (Not counting fake "doctors" like people with a degree in chiropractic.)

bought Ṁ10 of YES

@IsaacKing General practitioners, cardiologists, epidemiologists, orthopedic surgeons, etc.? Which ones? And what about specifically designed existing medical A.I. software which is not an LLM? LLM's do not excel at spatial reasoning, but if you go to a radiologist, they are going to get a huge amount of computer assisted input, though not an LLM and definitely not GPT-Anything. This standard practice in radiology.

Will GPT3.5 be at least as smart as Gary Marcus by 2023?

Result: YES, exhibits higher cognitive fluidity, better writing, and more insight

Suggestion: let’s deprecate this language model and upgrade to Rodney Brooks who at least makes intelligent predictions

predicted YES

@Gigacasting GPTs may be stochastic parrots, but unlike Gary Marcus, they are least parrot sources other than themselves!

Comment hidden