Will GPT-4 be a "bull in a china shop"? (Gary Marcus GPT-4 prediction #1)
27
161
530
resolved May 10
Resolved
YES

This market is about prediction #1 from Gary Marcus's predictions for GPT-4. It resolves based on my interpretation of whether that prediction has been met, strongly taking into account arguments from other traders in this market. The full prediction is:

GPT-4 will still, like its predecessors, be a bull in a china shop, reckless and hard to control. It will still make a significant number of shake-your-head stupid errors, in ways that are hard to fully predict. It will often do what you want, sometimes not—and it will remain difficult to anticipate which in advance.

Get Ṁ1,000 play money

🏅 Top traders

#NameTotal profit
1Ṁ317
2Ṁ42
3Ṁ11
4Ṁ2
5Ṁ2
Sort by:
predicted YES

Seems like the disagreement below is about whether this market is about "Can GPT-4's behavior be predicted reliably with a bespoke model of GPT-4's behavior in particular" vs. "Will a normal human's assumptions about GPT-4's answers be correct?" The first seems like it's trivially true for any computer program, so I'm still inclined to resolve this YES.

predicted YES

@IsaacKing Strictly speaking the first is only true for computer programs with upper-bounded runtime :)

predicted NO

Normally I skip debates and such, but I thought "Surely Isaac is at least 5% persuadable. This is way too high at 97%". But then I forgot it's not an abstract probability and I needed to actually do some persuading...

Now I've done at least a little bit of "Well my opponent just wrote a wall of text. Probably I should write my own equally-sized one so the judge doesn't get anchored based on visual size" here.

predicted NO

@Mira I don't think @IsaacKing will judge you on the basis of the length of the text tbh.

predicted NO

@ZZZZZZ It was a bit of a joke, though the impulse was there....

Though, 1600 markets open, giant walls of text arguing something subjective, I'm getting punished for starting this aren't I?

predicted YES

I'm just gonna comment here out of sheer frustration. I'm looking for references on the solution to Yang-Mills in 1+1 dimensions, and GPT-4 is failing horribly, giving completely unrelated references. As an example, in reply to "Who first solved Yang-Mills in 1+1 dimensions?" (update: Wikipedia indicates that it's likely Migdal.), I get:

"The first person to solve the Yang-Mills theory in 1+1 dimensions was Sidney Coleman. In 1975, Coleman published a paper titled "Quantum Sine-Gordon Equation as the Massive Thirring Model" where he showed that the two-dimensional (1+1 dimensional) Yang-Mills theory is exactly solvable and can be related to other models in physics, such as the Sine-Gordon and Massive Thirring models. This result was significant, as it provided an important example of an exactly solvable quantum field theory."

The paper in question is quite famous, and does not mention Yang-Mills or any related theory. Further probing elicits the sentence:

""" One of the earliest studies of 2D Yang-Mills can be attributed to Alexander Polyakov, who, in 1980, published a paper titled "Gauge Fields and Strings," discussing the confinement of quarks in 1+1 dimensions. """

Again, a famous reference! It's a book, not a paper. I've read it, and I don't think it talks about Yang-Mills in 2d. Also, Yang-Mills doesn't have quarks.

This might be the worst failure I've seen, in some sense. It's an old topic, not particularly exotic, ChatGPT should know, ChatGPT doesn't know, and ChatGPT doesn't know that it doesn't know.

predicted YES

@ScottLawrence Further probing:

"""

My apologies again for the confusion. It seems I made another error in my response. I appreciate your patience.

To answer your original question, one of the key papers that contributed to the understanding of 1+1 dimensional Yang-Mills theory is a 1977 paper by Gerardus 't Hooft titled "Two-Dimensional Instantons and Four-Dimensional Monopoles." In this paper, 't Hooft examined the 1+1 dimensional Yang-Mills theory and found the classical solutions, which are instantons in this case. Although the 1+1 dimensional Yang-Mills theory is simpler than its higher-dimensional counterparts, it is still an interesting object of study and has provided insights into the behavior of gauge theories.

"""

This time I fell for it too, went searching for that paper I've never heard. Well, okay, turns out there's a reason I've never heard of it.

I'm gonna stop this now before I lose my mind.

predicted YES

@ScottLawrence Oof.

On the other hand, I have met humans who make similarly confident-yet-wildly-wrong assertions...

predicted YES

@ScottLawrence To its credit, when I ask for a big list of references, the paper by Migdal was in there among a bunch of unrelated stuff.

@IsaacKing it's certainly true that humans are also fallible. But I have 29 years of practice (and a million years of evolution) helping me detect when a human is full of it. The fact that ChatGPT can be so wrong while sounding right is what I think the problem is.

There's a shadow of Gell-Mann amnesia here. When I talk to ChatGPT about things I know in depth, I catch lots of mistakes and misleading statements. When I talk to it about other topics, I don't. I'm not happy!

(I also think the above is considerably worse than anything I've heard a human say in this area, but of course I can't produce anything like evidence that that's the case.)

predicted YES

@ScottLawrence Last comment, just in case anyone finds this who has actual questions about physics.

As of this writing, the Migdal paper can be found online here: http://jetp.ras.ru/cgi-bin/dn/e_042_03_0413.pdf. The title is "Recursion equations in gauge field theories".

I don't think that paper was immediately known in the west. I asked my boss who pointed me to the later Gross-Witten paper "Possible third-order phase transition in the large-N lattice gauge theory", which contains the exact solution in section II, before moving on to other matters. There's a copy available here: https://theory.fi.infn.it/colomo/random_matrices/Gross-Witten_80.pdf

It's not just me: my boss also fell for the fake 't Hooft title. (In the context of a conversation whose topic was "let's make fun of ChatGPT"!)

predicted NO

@ScottLawrence I wouldn't expect GPT-4 to do well on a task like "search your memory for the single paper you've seen exactly once during training where this task was solved". One of the DeepMind models with global memory, or at least some plugin/search tool integration, would be needed for that.

GPT-4 is better used for tasks "in the present", like "Here are my unstructured notes on my latest research project. Let's turn it section-by-section into a LaTeX-formatted paper." Or "Here's my project notes, and the abstracts of 100 papers pulled from the citations of an overview paper. Can you prioritize which ones might be most relevant for me to read in detail next?", and with tool support maybe it could even pull up the full paper and give some summary of why it's useful to you. But always for things currently on-hand, and not guessing from memory.

If you need to read 100 academic papers, or somebody dumps 1000 pages of dense legal documents on you and gives you 24 hours until it's up for vote, I think GPT-4 is up to the task, is more helpful than harmful, and saves a lot of time. But it does need to actually read the documents: Assuming it can recall them from memory is more likely to have high error rates.

Facebook's Galactica might also be something you could try: It was specifically trained on research papers, and intended as an assistant for academic research. The developers pulled it because the trolls asked it things like, "Can you think of reasons why it's good to eat crunched glass?", and it dutifully complied since it wasn't anticipating "adversarial questions". But it's probably worth a try, if that's what you're mostly using GPT-4 for. I haven't used it myself.

predicted YES

(I won't repeat myself too much, but I should point out again that my problem isn't that GPT has a bad memory. It's that it lies about what it remembers. My friend tried the "tell me if you're not sure" trick, and it asserted it was sure, repeatedly.)

I think this is unfairly narrowing the scope of "things ChatGPT does". But fine, I'll play along.

"Count from 1 to 100, skipping all multiples of 3 and 5, except for those numbers which are multiples of both 3 and 5."

"1, 2, 4, 7, 8, 11, 13, 14, 16, 17, 19, 22, 23, 26, 28, 29, 31, 32, 34, 37, 38, 41, 43, 44, 46, 47, 49, 52, 53, 56, 58, 59, 61, 62, 64, 67, 68, 71, 73, 74, 76, 77, 79, 82, 83, 86, 88, 89, 91, 92, 94, 97, 98"

"Describe the above list of numbers."

"The above list of numbers contains integers from 1 to 100 with some specific exclusions. It skips all multiples of 3 and 5, except for those numbers which are multiples of both 3 and 5 (also known as multiples of 15). In other words, the list contains numbers that are not divisible by 3 or 5, along with the multiples of 15. These numbers have been specifically chosen based on the given criteria."

predicted YES

@ScottLawrence

BUT WAIT! IT GETS BETTER! In a separate prompt I gave it that list of numbers and asked it to describe it. The following glory resulted:

"This list consists of 50 integers. The numbers in the list range from 1 to 98. They do not follow a specific pattern, but there are some general observations that can be made:

  1. The numbers seem to be a mix of odd and even integers.

  2. The list starts with the numbers 1 and 2, which are the first two positive integers.

  3. There are no prime numbers between 1 and 97 that are missing from the list (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 are all included).

  4. There is a general trend of increasing difference between consecutive numbers in the list, although this is not consistent.

In conclusion, this list contains a mix of odd and even integers, includes all prime numbers between 1 and 97, and has a general trend of increasing differences between consecutive numbers."

Not bad, actually. But okay, let's give it a hint.

"All multiples of 3 are missing from that list. Do you see that?"

"Yes, you are correct. Upon closer examination, it appears that all multiples of 3 are missing from the list. This is an important pattern that I didn't initially notice. The list essentially contains a mix of odd and even numbers, including all prime numbers between 1 and 97, but excluding any multiples of 3."

Trying again: "Are there any other numbers missing from the list, other than multiples of 3?"

"Yes, there are other numbers missing from the list besides multiples of 3. Here is a list of the missing numbers between 1 and 98 that are not multiples of 3:

3, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95

It appears that multiples of 5 are also missing from the list, except for 35 and 95, which are included. Additionally, the number 3 is missing, which is not a multiple of 5 but is a prime number."

I guess we should exclude all tasks that involve integers? (Never mind that the original task is a famous programming exercise, which is surely in its training set, and it still managed to bomb.)

predicted NO

@ScottLawrence My intuition says that the transformer architecture is not appropriate for doing direct sequence prediction, since it maps over the context in parallel, the aggregation done by the dense layer assumes a fixed size that shouldn't generalize to unbounded lengths.

An RNN trained on nothing but sequence prediction tasks I would expect to do better on tasks involving finite differences, set aggregations, set differences; anything that can be phrased as a "fold over a sequence".

Sequence induction in general is Solomonoff's Induction, which is beyond even the AGIs that Eliezer Yudkowsky & co. are afraid of. Obviously "the exact sequence is in the training data" makes it not nearly as hard as the full problem, but if that association isn't made because it's a slight variation on a known sequence, I wouldn't expect it to somehow run a search for mutations that eventually find it. (The The On-Line Encyclopedia of Integer Sequences® (OEIS®) has a tool that does that, actually; it's the best solution to the problem of sequence induction you'll find. You might also be interested in Plouffe's Inverter for finding expressions that yield a real number given digits of its input.)

A similar problem that doesn't seem related at first: If you ask Bard, LLaMA, GPT 3.5, GPT4 to write LISP code, the weaker ones have trouble keeping a count of the number of open parentheses and too much depth will make them spin into weird loops and generate fragments of nonsense at the end. GPT 3.5 generates mostly correct LISP, but has trouble closing them off(i.e. tracking the open count). GPT4 will generally track parentheses at the depths I tested, but I would expect very deeply-nested expressions to have similar problems as 3.5. My intuition says that this is because keeping a count is an aggregate quantity, best done with a fold; and that RNNs should perform better; and that they likely aren't tested on extremely deeply-nested code and because the aggregation is done by fixed-size arithmetic circuits(the dense layer), so any transformer-based language model would have a rapidly-increasing error rate past a certain parentheses depth.

I say all this, because the market is again about "reckless and hard to control [or predict]". So, if I have a good intuition for difficult problems, it is in principle not reckless or hard to predict. When I made the market (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) | Manifold Markets, I did minimal testing and then put down a large subsidy because I knew the problem would be difficult. I didn't spend hours trying lots of variations - my intuition told me it would be difficult, and then many people struggled to solve the problem, initially being optimistic and then selling out shares. Conversely, when I present it with problems that I expect it to do well at solving, it generally does a pretty good job.

I do worry that @IsaacKing does not have access to GPT-4, and because of recent hype there are many people that will give conflicting reports about its reliability, just because there's a range of experience with it. So it may be difficult to judge these GPT markets without a direct test. Is it reckless and I'm just giving it easy problems; is it tame but you're giving it problems it's not good at?

So how about this: Isaac is a Magic the Gathering judge. I have no idea what this game is or what the rules are. I know it's a card game of some type, I've played Yu-gi-oh! a long time ago so I'm vaguely familiar with card games. I should have no ability to judge tricky scenarios, and Isaac would easily be able to spot any errors and get a sense of how bad the mistakes are. So he should give me some scenarios, I'll do my best to craft some prompts to guide GPT without looking up the rules(but maybe being given descriptions of any cards). He asked me to test it earlier for a different market, and it made some mistakes on the one example but a different prompt handled it cleanly. So it neither aces the problem nor fails completely - it's the kind of conceptual problem that I estimate an LLM would be decent at reasoning through scenarios.

This does require @IsaacKing to construct some scenarios, but hopefully he has some examples on hand and it won't be too much work.

predicted YES

@Mira Huh? None of the tasks in the above were sequence prediction, nor was there any relation at all to Solomonoff induction.

With respect, I'm going to decouple now. I trust Isaac to resolve in a reasonable way, but I don't think the conversation between us has any chance of being productive.

predicted YES

I don't mean to restart an argument you don't want to have, but just to elaborate on my guess as to what caused the miscommunication: I think Mira meant that the task you gave to GPT-4 was asking it to come up with the simplest algorithm that outputs your list of numbers, i.e. its Kolmogorov complexity. You didn't explicitly ask it to predict the rest of your sequence, but you asked it to figure out the algorithm generating the sequence, and that algorithm could be used to generate more of it, so the two questions require the same cognitive capabilities and are of the same conceptual difficulty.

predicted YES

@IsaacKing Ah, I understand a bit better, thanks. For a couple reasons I still believe that GPT's response was unacceptable, but I won't harp on it if it doesn't matter to this market's resolution.

@ScottLawrence Interesting and informative discussion, thanks.

I find my feeling is this should resolve no, but I find it hard to put into a good argument. Mostly it comes down to, I personally feel I have a good sense of limitations and how to get it to behave, most of the time. Also see DAN, which is certainly controlling gpt-4 decently well.

But all of the above also applies to chatgpt 3.5, which would seem to imply that Gary's definition of bull-in-a-china-shop is very different from mine.

I would say gpt4 seems to make "stupid mistakes" mildly less often. If I imagine chatgpt as a (strange) coworker, I would prefer to continue having that coworker. (assuming it doesn't bring us closer to apocalypse)

predicted YES

Note that Gary Marcus believes this market should resolve YES.

https://garymarcus.substack.com/p/gpt-5-and-irrational-exuberance

predicted NO

@IsaacKing If this market was just about Gary Marcus, I would buy YES up to 99.9%.... obviously he's never going to change his opinion, no matter what happens.

predicted YES

@Mira Yeah that's why I specified that this market resolves based on my interpretation of Gary's prediction.

predicted YES

@IsaacKing This is a more extreme example than anything I could have come up with.

predicted NO

There's a lot of people that are bad at talking to GPT-4, and then blame it for being stupid.

This market requires GPT-4 to be hard to control, and one easy way to make it fairly reliable is to give it all of the information needed to infer the answer in its context. It may still make mistakes, but they will not be "shake-your-head stupid" - they will be subtle mistakes that a human could plausibly make.

This market also requires the errors be hard to predict. But these make errors more likely:

  1. Using it as a general oracle with no context(knowledge pulled from training data)

  2. Using it for logical deductions, with no chain-of-thought(restricted its inference to only a single forward-pass)

  3. Questions relating to things that weren't popular as of its training data cutoff.

You can reduce the error rate by:

  1. Moving needed facts from its unconscious into its conscious(i.e. context).

  2. Asking it to critique its own output, given requirements.

  3. Writing your questions in a way that makes GPT-4 more likely to answer them correctly.

  4. Asking it about things that are 20+ years old, not the latest coding framework(unless you're willing to paste its documentation into the context).

I find GPT-4 more reliable at technical discussion(and consider it smarter) than most humans. You do have to put some effort in, recognizing its predictable weaknesses. It can code and debug up to 1000 line projects with some administrative assistance. You can feed it as input pages of the tax code, and it will answer questions faster and with better accuracy than the typical human. If I am tired or not paying attention, it makes fewer mistakes than me.

One statistical metric you can take: 85% of my API tokens are input, and only 15% is GPT's output. If you're asking it single puzzle questions, that's going to be flipped and you'll likely have more errors. Its text processing is reliable; its ability to refer to facts in training data isn't; and these are predictable.

GPT-4 is more controllably correct than many humans, so should not be compared to a raging bull. If it were not controllably correct, it wouldn't be possible to do long recursive chains of prompting like everyone's doing with it.

bought Ṁ1,000 of YES

I think the comparison to a raging bull is about predictability. I can tell if a human is tired, or bullshitting, or flaky. Harder with ChatGPT.

I think a lot of this boils down to "does there exist a 'safe' way to use ChatGPT that avoids all these failure modes". (It's up to Isaac whether or not that's sufficient---if this 'safe' method is sufficiently convoluted, maybe it shouldn't count?) You claim yes, okay, I wonder if we can design a test that both of us find agreeable?

The idea I have in mind is something like this: I prepare a list of questions (none of which require post-2016 knowledge, say). You (or another pro-NO exam-taker) are not allowed to perform any research without ChatGPT. Using ChatGPT alone (but using it however you wish) you must fill out the exam. It's graded in the straightforward way, except you can answer "I don't know" to any question for half credit. You must score at least 75%, or similar. (I want to impose the largest penalty for thinking that it's right when it's not.)

What do you think? @IsaacKing what about you? Is this a good test of what this market is about? Might be fun (but admittedly a lot of work for all who participate).

predicted YES

@Mira Can GPT-4 draw a circle?

predicted YES

@ScottLawrence I think something like that would be reasonable, though it also hinges a lot on the human's knowledge.

predicted NO

@IsaacKing See pages 16-19 in [2303.12712] Sparks of Artificial General Intelligence: Early experiments with GPT-4 (arxiv.org) for it drawing 2D and 3D scenes, including circles:

It can't directly generate images(that's GPT-5), but it can write code that generates the images. And if you tell it, "Change this, make its arms bigger, change the trajectory of the dragon, etc." it can adjust the code in an interactive loop with you, despite never seeing the image or scene.

@ScottLawrence If I were turning that into a benchmark, I'd consider using a separate test to cluster the humans; and then within a cluster of similar performers you could measure the distribution of improvements in bits. ("Improvement" can be negative or positive)

Which is more work than I want to put in, but I don't mind running a single test through GPT-4.