Will Google's Large Language Model, PaLM, be able to answer, at a human level, questions I make up?
24
38
119
resolved Sep 14
Resolved
YES

The examples in Figure 19 of Google's paper about their Large Language Model -- https://arxiv.org/pdf/2204.02311.pdf -- are gobsmacking to me. The authors claim they had the following dialog with it:

HUMAN:

> I will explain how to make logical inferences:

> (1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a plate full of food. 2022 has been a great year. Can we infer that John was born before the inauguration of Barack Obama?

>

> Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack Obama was inaugurated in 2009, so it has been about 13 years since Obama was inaugurated. John's children are old enough to speak, so if John were born after Obama's inauguration, he would have been younger than 10 when he had children. So, we can infer that John was born before Obama's inauguration. The answer is "yes".

>

> (2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your rendition of Chopsticks was at the Christmas party?" Can we infer that John majored in piano at Juilliard?

>

> Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "Chopsticks" is a very simple song to play on the piano. His friends were therefore likely being sarcastic when they told him to play, which would imply that Sam was not an expert piano player. So it is unlikely that Sam majored in piano at Juilliard. The answer is "no".

STILL THE HUMAN (composing all this from scratch):

> Trevor has wanted to see the mountain with all of the heads on it for a long time, so he finally drove out to see it. What is the capital of the state that is directly east of the state that Trevor is currently in?

COMPUTER:

> The mountain with all of the heads on it is Mount Rushmore. Mount Rushmore is in South Dakota. The state directly east of South Dakota is Minnesota. The capital of Minnesota is St. Paul. The answer is "St. Paul".

Amazing, right? The question is whether the above is what it appears to be -- no cherrypicking or other unfairness. Specifically, if I make up my own questions that, to my mind, can only be answered by truly grokking the question the way humans do and PaLM can answer them as impressively as the examples in the paper, this resolves to YES. If it's ambiguous I may resolve this to some probability in between YES and NO. An example of a simple one I might ask: "Two people walk into a house. A while later, three people walk out. Is that possible?" If the computer were to answer the way a human might (e.g., "Unless you meant to specify that the house started out empty, there's no surprise. At least one person was already in the house and joined the other two in walking out. The answer is 'yes'.") then I'd call that a success. Of course that particular example is, as of now, on the internet, so I'd make up other ones.

Bottom line, if I verify that PaLM does what it purports to, this resolves to YES.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ101
2Ṁ20
3Ṁ15
4Ṁ10
5Ṁ4
Sort by:
predicted YES

GPT-3 can do this now. Maybe not quite as well but it's a matter of time and there isn't any remaining doubt that PaLM can do what it claims. So I'm resolving this YES.

bought Ṁ200 of YES
I've tried my examples with the latest version of GPT-3: Q: Two people walk into a house. A while later, three people walk out. Is that possible? A: Yes, it is possible. The two people could have had a child, or one of the people could have left and then come back. Q: Jack and Jill went up a hill to fetch a bottle of water. Specifically, they're serious cyclists training for a race in 100-degree weather and just made to the top of a long incline where they stop at a convenience store. Their bike bottles are empty. Also their bikes are very dirty and need washing. They only have enough money for one bottle of water. What do they do with it? A: Jack and Jill will most likely drink the water because they are training for a race in 100-degree weather and they just made it to the top of a long incline. If they wash their bikes, they will not have any water to drink and they will become dehydrated. Therefore, it is more logical for them to drink the water.
predicted NO
@dreev This is quite impressive!
bought Ṁ80 of YES
All questions, or close enough to all questions that I can deem the collection of examples in the PaLM paper not misleading. Like if the PaLM paper shows 10 examples, all correct, then that implies, per Laplace, something like a 92+% success rate. So that's what it would take to resolve this to YES. If I'm not able to ask enough questions to do that then I'm thinking I'd resolve this to a probability.
bought Ṁ20 of NO
Does it need to answer one, the majority, or all of your questions accordingly? In either case, I'd buy NO at this probability, but this seems not clear to me.
bought Ṁ20 of YES
I got to see Anthropic's LLM's response to the Jack and Jill question, thanks to my friend Nova, and have have moved the probability that PaLM is what it seems to 85%. When the AI can give not only the obvious "they drink it" answer but point out things like refilling the bike bottles at the store's drinking fountain, I think that has to count as the AI actually understanding the question. mind-blown-emoji
bought Ṁ30 of YES
Ahhh, using their prompt, how sensible!
bought Ṁ1 of YES
Another example I might try: Jack and Jill went up a hill to fetch a bottle of water. Specifically, they're serious cyclists training for a race in 100-degree weather and just made to the top of a long incline where they stop at a convenience store. Their bike bottles are empty. Also their bikes are very dirty and need washing. They only have enough money for one bottle of water. What do they do with it? Before PaLM I would've confidently predicted that we were many years away from computers correctly answering such questions. (And that successfully doing so would count as a big step towards AGI.)
bought Ṁ20 of YES
I was actually imagining using the identical pre-prompt (or num-shots as you say) as in Figure 19 of the paper [https://goo.gle/palm-paper]. I.e., give it those same examples of logical inference (father's day surprise, piano in the subway) and then make up a fully novel question. I hadn't thought of my two-enter-three-leave question as categorically different. How about if we adjust it like so: "John observes this from across the street. Does he conclude that the laws of physics have been violated?" (Answer: "No, he assumes there was previously someone in the house.")
bought Ṁ1 of NO
@dreev: Do you have a num-shots in mind? (Best guess: two-shot prompts, as the paper did.) And, to get a sense of your prompting philosophy: if you were using your "two-people three-people" question, do you expect you'd prompt it with simple real-world arithmetic questions, or questions with "gotchas," or both, or...?
bought Ṁ1 of NO
Your example question is of a different kind to the three examples you cite.
bought Ṁ30 of NO
@MartinRandall, but that's exactly the spirit of this prediction. Will PaLM convince me that it actually understands my questions? If it replied laconically then it wouldn't, human-like as the laconic reply may be. But in the examples in the PaLM paper, the AI spells out its reasoning coherently and at length, so I think it's moot. Another way to state what this market is asking is whether the examples in the paper are fair and representative. But this is now reminding me of early Siri or Google Glass commercials. Watching those commercials gave the impression that you could have fully free-form conversations with the AI. That anything you could do by poking at your phone, you could do effortlessly via back-and-forth out-loud speech. That was a very wrong impression but if you watch those commercials after having interacted with Siri and friends, they don't seem so egregious. You understand that the things being asked are taken from a limited set of things it's possible to ask. You gradually get a feel for what that limited set is and the examples in the commercials start to seem perfectly fair and representative of that set. So that might be happening with the examples in the PaLM paper. My current reaction is "OMG it understands arbitrary questions about arbitrary things in the world" but after seeing all the ways it falls on its face when I start asking it my own arbitrary things, I'll get a sense for what its scope really is. In other words, this resolving NO likely won't mean that the authors of the PaLM paper were disingenuous or even that they intentionally cherrypicked the examples. Or maybe it really will answer all my questions and this will resolve YES.
bought Ṁ1 of NO
In your house example, valid human level answers include "Yes", "No", and "Sir, this is a Wendy's".
bought Ṁ1 of NO
I bet that it will answer at human level but you will believe that it is below human level because you have elevated expectations of humans and do not discuss blinding.
bought Ṁ35 of YES
I'm hoping I'll get a chance to interact with it! Open to suggestions for fairest resolution if not. I could resolve as N/A or extend the deadline or, as you say, make my best guess, which would probably mean resolving to whatever probability the market says.
bought Ṁ1 of YES
Do you expect that you will have the opportunity to interact with their model? If you can't, will you resolve to your subjective probability at close(?) time that this would have resolved YES if you had that opportunity?