In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?

Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.

This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.

  • What type of LLMs, @ScottAlexander ?

    • Transformer based? SSMs? MOEs?

    • Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?

  • What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?

    • i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)

i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.

perhaps the unreasonable part is where he didn't explain his thought process. but people get busy

this market and friends would probably be better off as a poll due the legion amount of ambiguities.

I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.

ie: don't take them too seriously.

If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)

Doesn't seem we're getting clarification on this, so I've made a duplicate of this market that removes the "bizarre hacking like tricks" exception.

@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?

"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.

In 2028, will LLMs be able to get Gary Marcus to make egregious errors?

@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.

What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?

"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know what else is likely to be excluded.

@IsaacKing I think it would distract from the question of 'have they gotten significantly better at reasoning and common sense', and would be more 'do they have some very specific pathology that is not actually remotely relevant'.

Though I agree it could be pinned down more, I'm just uncertain what it should be pinned down at.

Title says 'LLM', but description says 'leading chatbot'. If chatbots use mixed models or an entirely different model then LLMs by 2028 will those chatbots still be applicable?

Gigacasting


@Gigacasting Trolling aside, expecting the LLM to solve a simple problem without spending much time on prompt engineering is a fair demand, and one that's likely to become much less relevant in five years.

@NcyRocks When we test human intelligence we put a lot of work into prompting correctly. Failure to do so often gives spurious results. Even going from the math room to the chess room or the poetry room is many more bits of prompt than an LLM needs to produce its best work.

If you think most middle schoolers can do that

You might have only been around certain groups and not others…

More random nonsense

