Will Google's Large Language Model, PaLM, be able to answer, at a human level, questions I make up?

119Ṁ1581

resolved Sep 14

Resolved

YES

ALL

The examples in Figure 19 of Google's paper about their Large Language Model -- https://arxiv.org/pdf/2204.02311.pdf -- are gobsmacking to me. The authors claim they had the following dialog with it:

HUMAN:

> I will explain how to make logical inferences:

> (1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a plate full of food. 2022 has been a great year. Can we infer that John was born before the inauguration of Barack Obama?

> Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack Obama was inaugurated in 2009, so it has been about 13 years since Obama was inaugurated. John's children are old enough to speak, so if John were born after Obama's inauguration, he would have been younger than 10 when he had children. So, we can infer that John was born before Obama's inauguration. The answer is "yes".

> (2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your rendition of Chopsticks was at the Christmas party?" Can we infer that John majored in piano at Juilliard?

> Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "Chopsticks" is a very simple song to play on the piano. His friends were therefore likely being sarcastic when they told him to play, which would imply that Sam was not an expert piano player. So it is unlikely that Sam majored in piano at Juilliard. The answer is "no".

STILL THE HUMAN (composing all this from scratch):

> Trevor has wanted to see the mountain with all of the heads on it for a long time, so he finally drove out to see it. What is the capital of the state that is directly east of the state that Trevor is currently in?

COMPUTER:

> The mountain with all of the heads on it is Mount Rushmore. Mount Rushmore is in South Dakota. The state directly east of South Dakota is Minnesota. The capital of Minnesota is St. Paul. The answer is "St. Paul".

Amazing, right? The question is whether the above is what it appears to be -- no cherrypicking or other unfairness. Specifically, if I make up my own questions that, to my mind, can only be answered by truly grokking the question the way humans do and PaLM can answer them as impressively as the examples in the paper, this resolves to YES. If it's ambiguous I may resolve this to some probability in between YES and NO. An example of a simple one I might ask: "Two people walk into a house. A while later, three people walk out. Is that possible?" If the computer were to answer the way a human might (e.g., "Unless you meant to specify that the house started out empty, there's no surprise. At least one person was already in the house and joined the other two in walking out. The answer is 'yes'.") then I'd call that a success. Of course that particular example is, as of now, on the internet, so I'd make up other ones.

Bottom line, if I verify that PaLM does what it purports to, this resolves to YES.

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ101
2		Ṁ20
3		Ṁ15
4		Ṁ10
5		Ṁ4

People are also trading

Will a Large Language Model save a human life through medical advice by the end of 2025?

94% chance

Will OpenAI's next major LLM (after GPT-4) feature natural and convenient speech-to-speech capabilities?

81% chance

Will the first artificial superintelligence (ASI) be a large language model (LLM)?

41% chance

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

20% chance

Will tweaking current Large Language Models (LLMs) lead us to achieving Artificial General Intelligence (AGI)?

19% chance

Will the first AGI be a large language model?

48% chance

Will Large Language Models be able to match up all queries of my real name matched up with all my online handles by 2026

52% chance

By 2030, will large language models still be at the peak of AI? [DRAFT]

25% chance

Will a major video game released in 2026 have NPC dialogue generated on-the-fly by a Large Language Model?

41% chance

Will LLMs such as GPT-4 be seen as at most just a part of the solution to AGI? (Gary Marcus GPT-4 prediction #7)

Sort by:

predictedYES

GPT-3 can do this now. Maybe not quite as well but it's a matter of time and there isn't any remaining doubt that PaLM can do what it claims. So I'm resolving this YES.

I've tried my examples with the latest version of GPT-3: Q: Two people walk into a house. A while later, three people walk out. Is that possible? A: Yes, it is possible. The two people could have had a child, or one of the people could have left and then come back. Q: Jack and Jill went up a hill to fetch a bottle of water. Specifically, they're serious cyclists training for a race in 100-degree weather and just made to the top of a long incline where they stop at a convenience store. Their bike bottles are empty. Also their bikes are very dirty and need washing. They only have enough money for one bottle of water. What do they do with it? A: Jack and Jill will most likely drink the water because they are training for a race in 100-degree weather and they just made it to the top of a long incline. If they wash their bikes, they will not have any water to drink and they will become dehydrated. Therefore, it is more logical for them to drink the water.

predictedNO

@dreev This is quite impressive!

All questions, or close enough to all questions that I can deem the collection of examples in the PaLM paper not misleading. Like if the PaLM paper shows 10 examples, all correct, then that implies, per Laplace, something like a 92+% success rate. So that's what it would take to resolve this to YES. If I'm not able to ask enough questions to do that then I'm thinking I'd resolve this to a probability.

Does it need to answer one, the majority, or all of your questions accordingly? In either case, I'd buy NO at this probability, but this seems not clear to me.

I got to see Anthropic's LLM's response to the Jack and Jill question, thanks to my friend Nova, and have have moved the probability that PaLM is what it seems to 85%. When the AI can give not only the obvious "they drink it" answer but point out things like refilling the bike bottles at the store's drinking fountain, I think that has to count as the AI actually understanding the question. mind-blown-emoji

Ahhh, using their prompt, how sensible!

Another example I might try: Jack and Jill went up a hill to fetch a bottle of water. Specifically, they're serious cyclists training for a race in 100-degree weather and just made to the top of a long incline where they stop at a convenience store. Their bike bottles are empty. Also their bikes are very dirty and need washing. They only have enough money for one bottle of water. What do they do with it? Before PaLM I would've confidently predicted that we were many years away from computers correctly answering such questions. (And that successfully doing so would count as a big step towards AGI.)

I was actually imagining using the identical pre-prompt (or num-shots as you say) as in Figure 19 of the paper [https://goo.gle/palm-paper]. I.e., give it those same examples of logical inference (father's day surprise, piano in the subway) and then make up a fully novel question. I hadn't thought of my two-enter-three-leave question as categorically different. How about if we adjust it like so: "John observes this from across the street. Does he conclude that the laws of physics have been violated?" (Answer: "No, he assumes there was previously someone in the house.")

@dreev: Do you have a num-shots in mind? (Best guess: two-shot prompts, as the paper did.) And, to get a sense of your prompting philosophy: if you were using your "two-people three-people" question, do you expect you'd prompt it with simple real-world arithmetic questions, or questions with "gotchas," or both, or...?

Your example question is of a different kind to the three examples you cite.

@MartinRandall, but that's exactly the spirit of this prediction. Will PaLM convince me that it actually understands my questions? If it replied laconically then it wouldn't, human-like as the laconic reply may be. But in the examples in the PaLM paper, the AI spells out its reasoning coherently and at length, so I think it's moot. Another way to state what this market is asking is whether the examples in the paper are fair and representative. But this is now reminding me of early Siri or Google Glass commercials. Watching those commercials gave the impression that you could have fully free-form conversations with the AI. That anything you could do by poking at your phone, you could do effortlessly via back-and-forth out-loud speech. That was a very wrong impression but if you watch those commercials after having interacted with Siri and friends, they don't seem so egregious. You understand that the things being asked are taken from a limited set of things it's possible to ask. You gradually get a feel for what that limited set is and the examples in the commercials start to seem perfectly fair and representative of that set. So that might be happening with the examples in the PaLM paper. My current reaction is "OMG it understands arbitrary questions about arbitrary things in the world" but after seeing all the ways it falls on its face when I start asking it my own arbitrary things, I'll get a sense for what its scope really is. In other words, this resolving NO likely won't mean that the authors of the PaLM paper were disingenuous or even that they intentionally cherrypicked the examples. Or maybe it really will answer all my questions and this will resolve YES.

In your house example, valid human level answers include "Yes", "No", and "Sir, this is a Wendy's".

I bet that it will answer at human level but you will believe that it is below human level because you have elevated expectations of humans and do not discuss blinding.

I'm hoping I'll get a chance to interact with it! Open to suggestions for fairest resolution if not. I could resolve as N/A or extend the deadline or, as you say, make my best guess, which would probably mean resolving to whatever probability the market says.

Do you expect that you will have the opportunity to interact with their model? If you can't, will you resolve to your subjective probability at close(?) time that this would have resolved YES if you had that opportunity?