Will GPT-4 be unreliable at reasoning about the physical, psychological, and mathematical world? (Gary Marcus GPT-4 prediction #2)

970Ṁ2882

resolved Jul 18

Resolved

YES

ALL

This market is about prediction #2 from Gary Marcus's predictions for GPT-4. It resolves based on my interpretation of whether that prediction has been met, strongly taking into account arguments from other traders in this market. The full prediction is:

Reasoning about physical, psychological and mathematical world will still be unreliable, GPT-3 was challenged in theory of mind, medical and physical reasoning. GPT-4 will solve many of the individual specific items used in prior benchmarks, but still get tripped up, particularly in longer and more complex scenarios. When queried on medicine it will either resist answering (if there are aggressive guardrails) or occasionally spout plausible-sounding but dangerous nonsense. It will not be trustworthy and complete enough to give reliable medical advice, despite devouring a large fraction of the Internet.

GPT-4 speculation

Gary Marcus GPT-4 predictions

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ75
2		Ṁ51
3		Ṁ41
4		Ṁ12
5		Ṁ7

People are also trading

Will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia, before 2026?

48% chance

What will be true about GPT-5?

Will LLMs such as GPT-4 be seen as at most just a part of the solution to AGI? (Gary Marcus GPT-4 prediction #7)

91% chance

Will GPT-4 escape?

5% chance

Gary Marcus prediction: physical reasoning of AI systems will improve before psychological reasoning

Sort by:

predictedYES

Here's GPT-4 failing to reliably reason about the mathematical world:

https://manifold.markets/dreev/will-an-llm-be-able-to-solve-confus

Reasoning about the physical world seems similar. I gave it a convoluted but I think clearly specified scenario involving nailing down your furniture and turning your house upside-down. It did well for a while and then got itself confused about whether the furniture was nailed to the floor or the ceiling.

If you doubt this model is much better at math than Gary, in addition to outperforming doctors (as rated by doctors) and at theory of mind,

I have a lot of Harry Potter fanfic, weight gain formulas, and bad fedoras to sell you

Note that Gary Marcus believes this market should resolve YES.

https://garymarcus.substack.com/p/gpt-5-and-irrational-exuberance

@IsaacKing meh, he's obviously biased, and he doesn't actually give evidence, just claims victory.

Need evidence of "plausible-sounding but dangerous nonsense" medical answers, for example.

@MartinRandall Yeah, I assumed they'd declare victory for anything short of superintelligence.

@IsaacKing I've not read it (on mobile) but something like this MS paper should answer that.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | NEJM https://www.nejm.org/doi/10.1056/NEJMsr2214184

Today GPT-4 got 1/3-1/2 incorrect as part of a general query abt two functions A(n) and B(n) at large values of n.

It said it's 1/6, i correct it and it accepts that it was wrong, thanks me for correcting it, and then proceeds to redo the "math", again, incorrectly.

This repeats 3X.

@VipinBaswan

Roses are red,
Violets are blue,
GPT-4's logic
May not be true.

predictedYES

GPT-4 still seems to be iffy on things like walls vs floors vs ceilings:

Q. Hypothetical: I hammer a nail halfway into the wall, straight rather than at an angle, to hang a picture say. Is the nail perpendicular to the ceiling or parallel with it?

ChatGPT-4: When you hammer a nail straight into the wall (vertically), it is perpendicular to both the ceiling and the floor. It is not parallel to the ceiling because parallel lines run in the same direction without intersecting, while perpendicular lines intersect at a 90-degree angle. In this case, the nail is driven into the wall at a 90-degree angle with respect to the ceiling and floor.

predictedYES

@ML Now that I finally have API access I can check what GPT-4 (still just the text-only version, unfortunately) answers at temperature zero. Its answer to the above prompt is "The nail is perpendicular to the ceiling." If I argue with it I can eventually get it to admit that this doesn't make sense and come up with the right answer but it takes effort, and a human would not make a mistake like this in the first place (assuming they knew what the words "parallel" and "perpendicular" mean, which GPT-4 clearly does -- none of the errors I observed are due it not knowing the word meanings).

@IsaacKing are you waiting to see if the image-capable version of GPT-4 does better at these types of problems?

Based on what I've seen so far, I believe this resolves YES. I haven't played around much with GPT-4 myself though.

Any counterarguments?

@IsaacKing Definitely unreliable reasoning about the mathematical world: https://twitter.com/RokoMijic/status/1635850879055085573

Many humans also don't know that there is no largest prime with a prime number of digits, so that example isn't proof of sub-human reliability, but there are other examples.

Do you have examples for unreliable reasoning for:

Theory of mind
Physical reasoning
Psychological reasoning
Medical reasoning

Currently I would resolve this in the range 20-100%, based on further examples in these areas.

How does this resolve if GPT-4 is reliable on one of these many domains? I read the prediction as saying that GPT-4 will be unreliable in all these areas, so if it has success in one of them that falsifies the prediction. Correct?

predictedYES

@MartinRandall Yeah, I think it resolves YES only if it is unreliable in all those domains but I am biased since I hold YES shares.

predictedYES

@MartinRandall I guess biased in the opposite direction actually.

Relevant re: medical advice specifically: https://twitter.com/emollick/status/1610261628607512576?s=20&t=966G-pBdU7iidFWv5Y0BIA

What is the benchmark for giving medical advice? Eg, what if the medical advice is of comparable quality to a typical medical professional with the same diagnostic tool of patient text-based reports?

I ask because medical professionals have also been known to spout plausible-sounding but dangerous nonsense. Many covid-related examples come to mind.

@MartinRandall If it's at least as good as typical human doctors, that's sufficient. (Not counting fake "doctors" like people with a degree in chiropractic.)

@IsaacKing General practitioners, cardiologists, epidemiologists, orthopedic surgeons, etc.? Which ones? And what about specifically designed existing medical A.I. software which is not an LLM? LLM's do not excel at spatial reasoning, but if you go to a radiologist, they are going to get a huge amount of computer assisted input, though not an LLM and definitely not GPT-Anything. This standard practice in radiology.