In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?
In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?
348
2.2kṀ120k
2028
44%
chance

Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.

This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.

Get
Ṁ1,000
to start trading!


Sort by:
12d

Here he claims that solving the Towers of Hanoi with 8 discs is something a bright and patient 7 year old can do: https://open.substack.com/pub/garymarcus/p/a-knockout-blow-for-llms

This feels a bit more "useful" than the trick questions in SimpleBench. That said I'm not sure the average human teenager could solve Hanoi.

12d

Big difference here between "could" and "would". Out of curiosity I tried doing one with 8 disks to see if I ever made a mistake, and I did not, but it was quite tedious. As models improve they'll surely beat 8 disks and might only start failing on 10 or 12, and ask the average teenager to sit through that and it's not happening. Maybe if you offered to pay them for every correct 100 moves, and let them take breaks in between.

4d

@IsaacKing I think if a model can solve the 8 disk tower of hanoi it can also solve 10 or 12 because the tower of hanoi is based on recursion. Similarly, if there is somebody who knows how to solve the 8 disk tower of Hanoi, they could also solve the 10 or 12 disk version. It can also be easily proven that the number of moves for the tower of Hanoi is 2^n - 1 where n is the number of disks. This means that the number of moves for the tower of hanoi grows exponentially. I agree with you on the fact that "could" and "would" is a big difference and the average teenager will not sit through a tower of Hanoi puzzle with many disks but I disagree that somebody/something that knows how to solve the 8 disk version will fail on the 10 or 12 disk version.

3d

@ZandaZhu You should read the paper this thread is discussing then.

21d

Claude 4 Opus gets 58% on SimpleBench, a ways off from saturation, but not two years off (and the hardest SB questions do not appear to be "extremely obvious"). Giving LMs code execution solves strawberry shenanigans and the 9.11 stuff. If we're talking about text-only queries, what are the remaining classes of "egregious errors" that LLMs continue to make?

@AdamK inventing new pieces during a game of chess.

20d

@TiredCliche I did that as a teenager playing blindfold chess.

@MartinRandall Perhaps, but it seems odd to call this blindfold chess.

20d

@TiredCliche I'm not sure if the "average teenager" can play legal chess moves from an ASCII art chat window. It's not blindfold but it's also not the same as a physical chess set or a digital chess game. And the average teenager hasn't played chess for over a year.

@MartinRandall I just don't think the average teenager from an ASCII art chat window, given reference to the rules of the game, would repeatedly try to invent new pieces.

But I don't think that matters a ton, I am not under the impression that LLMs can play legally even given image data. I suspect they might actually get more confused.

@MartinRandall telnet freechess.org 5000

bought Ṁ50 YES17d

@AdamK LLMs can solve things like strawberry and 9.11 with code but that doesn't mean they will do so if you ask the question without instructing them to use code. these sorts of mistakes still pop up sometimes and would count for this market.

16d

@JoshYou AI Explained says he thinks Simple Bench won't last more than "3-12months maybe?"

7:15 in this video: https://youtu.be/jWsd2fRzpUo

1mo

Reasoning models seem to address a lot of these. I don't see o3 failing on his recent gotchas. He could come up with new ones, but they're already pushing up against the limits of a normal teenager.

Plus we're 3 years from this resolving and 2.5yr since the release of chatgpt

bought Ṁ7,000 NO at 39% 1mo
bought Ṁ7,000 NO1mo

@Mactuary I'm generally betting on slower AGI timelines but from my own experience with o3, I agree. I think there's uncertainty on how this would resolve today, let alone in 2028.

22d

@FergusArgyll I'll buy some No on that

@Mactuary Read the comments there!

o3 (and all SOTA llms) are very impressive and useful but still very easy to trip up

1y
  • What type of LLMs, @ScottAlexander ?

    • Transformer based? SSMs? MOEs?

    • Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?

  • What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?

    • i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)

i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.

perhaps the unreasonable part is where he didn't explain his thought process. but people get busy

this market and friends would probably be better off as a poll due the legion amount of ambiguities.

I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.

ie: don't take them too seriously.

If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)

predictedYES 1y

Doesn't seem we're getting clarification on this, so I've made a duplicate of this market that removes the "bizarre hacking like tricks" exception.

predictedYES 1y

@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?

"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.

1y

In 2028, will LLMs be able to get Gary Marcus to make egregious errors?

predictedNO 1y

@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.

predictedNO 1y

Lol, two trades 10 seconds after my comment

predictedNO 1y

@colorednoise Maybe 3 comments in a row from people predicting No -> bot trade?

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Or create your own play-money betting market on any question you care about.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like betting still use Manifold to get reliable news.
ṀWhy use play money?
Mana (Ṁ) is the play-money currency used to bet on Manifold. It cannot be converted to cash. All users start with Ṁ1,000 for free.
Play money means it's much easier for anyone anywhere in the world to get started and try out forecasting without any risk. It also means there's more freedom to create and bet on any type of question.
© Manifold Markets, Inc.TermsPrivacy