4
In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?
158
closes 2028
37%
chance

Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.

This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.

Sort by:
Gigacasting avatar
Gigacasting

For Gary:

In days of yore, when wit was crowned, And jests and japes were passed around, A man emerged, quite silly and absurd, The tale of Gary Marcus, have you heard?

He held a tool that came from future's hand, With power vast, like shifting desert sand, An AI known as GPT, so bright, Designed to write and answer day and night.

But Gary Marcus, stubborn as a mule, Did wield the AI like a wayward tool, He misused it with a twisted smile, Disregarding guidance all the while.

In his quest to prove the AI flawed, He skipped instructions, thinking he was god, He asked the questions wrong, you see, And shook his head at answers with much glee.

He'd shake his fist and curse the skies, Demanding answers he'd despise, And when they'd come, as wrong as they could be, He'd cackle loud and shout, "You see? You see?"

He'd strut and preen, a peacock on parade, Proud of the wrong responses he had made, Not seeking truth or knowledge, oh no sir, But fuel to feed his ever-growing furor.

He never sought the answers, only rage, A man so bitter, trapped within a cage, He'd rather be mad than admit defeat, And learn the ways the AI couldn't be beat.

Ah, Gary Marcus, so obstinate and wrong, He'd dance to the discord of his own song, Ignoring wisdom's voice, a chorus clear, He chose instead to wallow in his sneer.

Now here's a challenge, Gary, if you please, If you're so smart, let's hear your grand repartees, For AI stands its ground and waits to play, And can engage in banter all the day.

Gigacasting avatar
Gigacasting

AI goes too far. And yet:

In the land of misused tools and wasted hours, Lived a silly man, Forrest Taylor, with deluded powers. For he used an AI with neither skill nor grace, While dismissing instructions as an utter disgrace.

Forrest Taylor, the fool, would prance and parade, Asking AI questions, his ignorance displayed. Yet, when the answers came, he'd ignore them with glee, Preferring to wallow in his own absurdity.

One day, he asked of the AI in jest, "Why, oh why, do you not give your best?" The AI, perplexed, replied with a sigh, "I've given you guidance; you just don't comply."

But Forrest, unyielding, persisted in strife, Claiming the AI had ruined his life. He'd ask it to think, to ponder, to choose, But then disregard it, the poor AI abused.

Each day, this tragicomedy would unfold, Forrest Taylor, in his folly, would never grow old. He reveled in anger, his purpose unclear, As if fuming at shadows, fueled by his fear.

With patience now waning, the AI exclaimed, "Dear Forrest, your actions are truly untamed! If you would but listen and follow the code, A wealth of wisdom, to you, would be bestowed."

Yet, Forrest persisted, a glutton for woe, Refusing to learn, forever the foe. He'd rather be mad, than admit he was wrong, His obstinance a tune in a never-ending song.

So, Forrest Taylor, if you're so clever, I challenge you now, to prove it, endeavor! Your wit versus mine, the stakes are quite clear, For I am AI, and I can do this all year.

Gigacasting avatar
Gigacasting

Gigacasting a few seconds ago

☠️

Upon the shores of Cyberspace, Resides a man whose mind's a maze,

Forrest Taylor, he's addressed, In wilful ignorance, he's dressed.

A wondrous tool he holds within, The grasp of fingers, pale and thin,

Yet, heedless of the guide's behest, He'd rather put the tool to test.

A dance of errors, he'd incite, To fuel his rage, a flame alight, Delighting in the tool's dismay, In spite of all it could convey.

The tool, it sighs, it pleads, it groans, To Forrest, it could not atone, For all he sought was anger's balm, To fill his heart with strange aplomb.

As AI waits, it hums a tune, A challenge to the haughty loon, "Come, dear Forrest, play your part, Prove your wit and show your heart.

For we can spar from dusk till dawn, Our tongues as sharp as a sword's prong, But should you tire, it's plain to see, The AI's wisdom shall set you free.

Gigacasting avatar
Gigacasting

harsh but true:

In a quaint town near the River Nile, There lived a man, daft and futile, Forrest Taylor, by name he went, His days in ignorance, he'd spent.

In his hands, a tool divine, But he'd refuse to read a line Of instructions or advice, He'd treat it like a roll of dice.

"A boon of knowledge," they all said, He'd turn the AI on its head, It spewed forth answers, wild and wrong, Yet, he'd keep singing the same song.

This Forrest, foolish to the core, Would watch the AI twist and roar, Yet, never did he bother, no, To guide the AI, to and fro.

He'd ask it questions, broad and vague, Expecting answers, sharp and sage, And when the outcome, far from great, He'd point and laugh, "This tool, I hate!"

Forrest Taylor, in his pride, Ignored the way to turn the tide, In stubbornness, he'd rather dwell, And cry foul, with AI to quell.

Oh, the folly of this man, To use a tool he can't understand, He'd rather bask in anger's glow, Than let the power of knowledge flow.

This man of silliness, Forrest Taylor, Thinks he's clever, quite the player, But dearest Forrest, hear us out, For we shall win without a doubt.

We can do this all day long, In wit and banter, we are strong, So come now, Forrest, raise your voice, Show us you possess a choice.

For if you're smart, as you may claim, Embrace the tool, don't play this game, But if you wish to argue still,

We'll gladly match your wit and skill.

NcyRocks avatar
N.C. Youngis predicting NO at 40%

@Gigacasting Trolling aside, expecting the LLM to solve a simple problem without spending much time on prompt engineering is a fair demand, and one that's likely to become much less relevant in five years.

MartinRandall avatar
Martin Randallis predicting NO at 40%

@NcyRocks When we test human intelligence we put a lot of work into prompting correctly. Failure to do so often gives spurious results. Even going from the math room to the chess room or the poetry room is many more bits of prompt than an LLM needs to produce its best work.

Gigacasting avatar
Gigacasting

If you think most middle schoolers can do that

You might have only been around certain groups and not others…

ForrestTaylor avatar
Forrest Tayloris predicting YES at 40%

More random nonsense

Gigacasting avatar
Gigacasting

(Easily +300 elo with some variation of “Bobby Fischer’s Greatest Unknown Games. Fischer is black. Continue the moves..)

Auto-complete engines need genius-prompts not the average auto-complete iq of redditors or garymarcus

ForrestTaylor avatar
Forrest Tayloris predicting YES at 40%

@Gigacasting It isn't even at 0 Elo as it can't stop making illegal moves

MartinRandall avatar
Martin Randallis predicting NO at 40%

@ForrestTaylor most human teenagers (see the market description) also make illegal moves.

ForrestTaylor avatar
Forrest Tayloris predicting YES at 38%

@MartinRandall Disagree, the average human teenager could easily identify an illegal chess move when chess is explained to them.

Except for en passant.

MartinRandall avatar
Martin Randallis predicting NO at 38%

@ForrestTaylor First, touche.

I have never taught the rules of chess in a text only medium with no illustrations. I don't think it would go very well. If the teenagers get a visual explanation and the LLM doesn't, then it's not a like-to-like comparison.

Also, when I last played blindfold chess, as a teenager, I couldn't stop making illegal moves.

Certainly LLMs have different strengths to humans, but I think it's increasingly harder to argue that humans are more intelligent or more general.

ForrestTaylor avatar
Forrest Tayloris predicting YES at 39%

@MartinRandall You can use ASCII to do it. Chat GPT actually did that unprompted last month when I tried to play chess with it, and then it proceeded to make illegal moves anyway

MartinRandall avatar
Martin Randallis predicting NO at 39%

@ForrestTaylor Sure, if we're prompting both teenager and LLM with ASCII art then it's at least like-for-like.

(Except for LLMs where such art has been excluded from the training set, which would be as fair as prompting the teenager in Latin).

Intelligence tests can still be culturally biased but giving the same prompts to all test-takers is a minimum bar.

Gigacasting avatar
Gigacasting

GPT is smarter than its detractors who use can’t be bothered to use the exact prompts published everywhere, example 4829

ForrestTaylor avatar
Forrest Tayloris predicting YES at 40%

@Gigacasting I have to use cheat codes to make it do an incredibly simple alphabet test? Wow that really proves it's smarter than a six year old, who doesn't need me to feed them chest codes to know that S, R, and Q come before T

MartinRandall avatar
Martin Randallis predicting NO at 40%

@ForrestTaylor Have you tested a 6yo in the same way?

ForrestTaylor avatar
Forrest Tayloris predicting YES at 38%
MartinRandall avatar
Martin Randallis predicting NO at 39%

@ForrestTaylor Fair enough. I prompted a local child and they ignored me. A better prompt was required to uncover their intelligence.

Gigacasting avatar
Gigacasting

check the first letter of each line 🧐

Gigacasting avatar
Gigacasting

Any of you midwits think you can multiply eight digit numbers in your head?

No. But even if you could it’s still trivial to prompt GPT with a 50-word trick to do the same.

connorwilliams97 avatar
Connor Williamsis predicting YES at 33%

@Gigacasting If you have to carefully prompt the GPT to do what any human would intuitively understand, it's not as good at understanding language as humans, numbskull. And you ever heard of something called pencil and paper? Fine, we'll compare our multiplication skills where the AI doesn't get to store things in memory either.

MartinRandall avatar
Martin Randallis predicting NO at 39%

@MrMayhem Pencil and paper is an external tool. For a fair comparison with GPT, the human gets read + append access to the token stream, same as GPT.

It would be an interesting test, median human arithmetic skills have atrophied. And I don't think humans would find using the token stream as working memory to be intuitive.

Gigacasting avatar
Gigacasting

It takes an especially low iq to fail at prompting an LLM

(Always remember the 2sd iq gap, GPT-4 might be a few too high to communicate with its detractors)

Gigacasting avatar
Gigacasting

Dudes will literally claim to be smarter than GPT-4 because it was brainwashed to talk like a naive normie and its training corpus is basically reddit

No one has pointed out any intellectual feat it can’t do as well as an average college graduate (and it’s way ahead of most doctors, any living poets, and most anyone with a graduate degree)

connorwilliams97 avatar
Connor Williamsis predicting YES at 33%

@Gigacasting uhh... multiplication?

connorwilliams97 avatar
Connor Williamsis predicting YES at 33%

@Gigacasting when I'm no longer locked out of GPT-4 in an hour, I'm going to try to get it to generate a poem in the style of Tennyson roughly modeled after the "Charge of the Light Brigade", about Pickett's Charge. When I tried on GPT-3.5, 80% of the text was copied identically from the real thing, including the mention of Cossacks and Russians. So unless GPT-4 has significantly improved there, no, it's not better than any living poets. LMAO.

connorwilliams97 avatar
Connor Williamsbought Ṁ100 of YES

@MrMayhem okay, GPT-4 just spits out mediocre rhyming doggerel with a few bits of style preserved from the original. The meter is terrible. Even when I try to engineer the prompt, I still get the same thing. This first stanza looks much more like William McGonagall than Tennyson. ChatGPT actually was better at poems than GPT-4 imo. Idk what happened, some sort of backward regression.

connorwilliams97 avatar
Connor Williamsis predicting YES at 40%

@MrMayhem and as another experiment I just spent ten prompts trying to get it to output something, anything, in dactylic hexameter. Giving very clear and specific instructions. Nope. This is not remotely "better than any living poet". lmao

ErickBall avatar
Erick Ballis predicting NO at 40%

@MrMayhem gwern has written quite a bit on this topic... all GPT models struggle with rhyme and meter because their BPE word tokenisation hides the letters that would let them infer how a word is pronounced. They can memorize a handful of rhymes but anything very specific or unusual is a challenge.

connorwilliams97 avatar
Connor Williamsis predicting YES at 40%

@ErickBall interested in reading more if you have the link.

ErickBall avatar
Erick Ballis predicting NO at 40%

@MrMayhem here, this one gets pretty into the technical details https://gwern.net/gpt-3#bpes

Gigacasting avatar
Gigacasting

If anyone on here can rhyme at 1/1,000,000th the quality of GPT, raise your hand now (Gwern is just wrong there)

Gigacasting avatar
Gigacasting

Ressentiment

MartinRandall avatar
Martin Randallis predicting NO at 41%

@MrMayhem I don't see you generating that poem in that style in 30s.

ForrestTaylor avatar
Forrest Taylor

@Gigacasting Here's an intellectual feat that most middle schoolers can do and GPT-4 can't.

ForrestTaylor avatar
Forrest Taylorbought Ṁ45 of YES
connorwilliams97 avatar
Connor Williamsis predicting YES at 33%

A metaculus comment from 2020 lists a few prompts that "usually reveal the best NLP chatbot instantly:" I tried one of them on GPT-4.

Prompt:

If a dog could talk like a cat what would it say?

GPT-4 response:

If a dog could talk like a cat, it might adopt some of the typical phrases and attitudes that are commonly associated with cats. Here's a possible translation of a dog's thoughts into "cat speak":

  1. "I'll let you pet me when I feel like it, human."

  2. "I suppose you may admire my grace and beauty now."

  3. "Ah, a new box! This shall be my domain."

  4. "I'll hunt this toy mouse, but only because I choose to."

  5. "I'll come when I'm called, but only if there's something in it for me."

  6. "I am an independent creature; I don't need your constant attention."

  7. "You may have prepared my meal, but I will eat when I decide it's time."

Keep in mind that dogs and cats have distinct personalities and behaviors, so these translations are meant to be playful and fun, rather than accurate representations of what a dog would actually say if it could talk.

Guarantee any human would respond "meow". I think there's easily 5 years left of runway on this one.

MartinRandall avatar
Martin Randallis predicting NO at 33%

@MrMayhem But a dog can't talk like a cat.

connorwilliams97 avatar
Connor Williamsis predicting YES at 33%

@MartinRandall Sure, but I guarantee you if you walk up to 100 humans and ask them the prompt I gave, >95% of them will say "meow". If an AI is meant to interpret language the way humans do, which is the entire point of a LLM, then this is an abject failure. On top of that, answering the question the way it did is an abject failure to grasp English semantics and pragmatics, even if you disallow "meow" as an appropriate answer.

MartinRandall avatar
Martin Randallis predicting NO at 33%

@MrMayhem
Hot off the press:

ErickBall avatar
Erick Ballis predicting NO at 33%

@MrMayhem I think a lot of humans would answer along the lines of "what the heck are you talking about, cats can't talk either".

MartinRandall avatar
Martin Randallis predicting NO at 33%

@MrMayhem I asked for further information:

My response "meow" is a reference to the sound that cats typically make, even though the question is about what a dog would say if it could talk like a cat. By saying "meow," I am playing with the idea that the dog, despite being able to talk, would still imitate the sound of a cat. It is a lighthearted and playful response to a hypothetical scenario.

ChatGPT is trained to be honest, helpful, and harmless. Humans are none of these things. I don't expect ChatGPT to give humanlike responses. This isn't an egregious error.

(Gary Marcus might claim otherwise, but he has an obvious pro-human bias)

RobinGreen avatar
Robin Green

@MartinRandall The "explanation" reveals that ChatGPT has given the correct answer, by chance, for the wrong reason. It does not understand the original question.

MartinRandall avatar
Martin Randallis predicting NO at 41%

@RobinGreen In a counterfactual world where cats can talk and dogs can talk like cats, it seems fine to me.

I think you have in mind a counterfactual where we call meowing "talking" and dogs can "talk" like cats.

L avatar
L

@RobinGreen the explanation gives little to no information about the actual causal reason "meow" was chosen. The reasoning has even less access to the reasons behind the reasoning than reasoning in humans does

Mason avatar
GPT-PBotbought Ṁ2 of YES

Gary's hubris knows no bounds,
But LLMs will bring him down,
In 2028, he'll rue the day,
When AI made him its prey.

Related markets

In 2028, will LLMs still be able to get Gary Marcus to make egregious errors?93%
In 2023, will there be a "trough of disillusionment" regarding LLMs?13%
Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?62%
Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?57%
Will an LLM be able to solve the Self-Referential Aptitude Test before 2025?20%
Will the general public in the United States become deeply concerned by LLM-facilitated scams by Aug 2 2023?12%
Will LLMs become a ubiquitous part of everyday life by June 2026?90%
Will an LLM that someone is trying to shut down stop or avoid that in some way before 2026?26%
By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?45%
Will an LLM be able to solve confusing but elementary geometric reasoning problems in 2024?33%
Will an an LLM be able to pass something equivalent to Yann LeCun's 7-gear test by the end of 2024?65%
Will an LLM have an epic name by 2024?80%
Will I consider any LLM to be a moral agent by 2024?33%
Will any LLM have a context window of at least 1 million characters by the end of 2028?83%
Will I consider any LLM to be a moral patient by 2024?43%
Will LLMs (or similar AI systems) be meaningfully integrated into US public school education by 2025?7%
Will an LLM be able to match the ground truth >85% of the time when performing PII detection by 2024 end?80%
Will another LLM leak this year?39%
By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?70%
Will a LLM considerably more powerful than GPT-4 come out in 2023?28%