Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.
This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.
People are also trading
Here he claims that solving the Towers of Hanoi with 8 discs is something a bright and patient 7 year old can do: https://open.substack.com/pub/garymarcus/p/a-knockout-blow-for-llms
This feels a bit more "useful" than the trick questions in SimpleBench. That said I'm not sure the average human teenager could solve Hanoi.
Big difference here between "could" and "would". Out of curiosity I tried doing one with 8 disks to see if I ever made a mistake, and I did not, but it was quite tedious. As models improve they'll surely beat 8 disks and might only start failing on 10 or 12, and ask the average teenager to sit through that and it's not happening. Maybe if you offered to pay them for every correct 100 moves, and let them take breaks in between.
@IsaacKing I think if a model can solve the 8 disk tower of hanoi it can also solve 10 or 12 because the tower of hanoi is based on recursion. Similarly, if there is somebody who knows how to solve the 8 disk tower of Hanoi, they could also solve the 10 or 12 disk version. It can also be easily proven that the number of moves for the tower of Hanoi is 2^n - 1 where n is the number of disks. This means that the number of moves for the tower of hanoi grows exponentially. I agree with you on the fact that "could" and "would" is a big difference and the average teenager will not sit through a tower of Hanoi puzzle with many disks but I disagree that somebody/something that knows how to solve the 8 disk version will fail on the 10 or 12 disk version.
Claude 4 Opus gets 58% on SimpleBench, a ways off from saturation, but not two years off (and the hardest SB questions do not appear to be "extremely obvious"). Giving LMs code execution solves strawberry shenanigans and the 9.11 stuff. If we're talking about text-only queries, what are the remaining classes of "egregious errors" that LLMs continue to make?
@TiredCliche I'm not sure if the "average teenager" can play legal chess moves from an ASCII art chat window. It's not blindfold but it's also not the same as a physical chess set or a digital chess game. And the average teenager hasn't played chess for over a year.
@MartinRandall I just don't think the average teenager from an ASCII art chat window, given reference to the rules of the game, would repeatedly try to invent new pieces.
But I don't think that matters a ton, I am not under the impression that LLMs can play legally even given image data. I suspect they might actually get more confused.
@JoshYou AI Explained says he thinks Simple Bench won't last more than "3-12months maybe?"
7:15 in this video: https://youtu.be/jWsd2fRzpUo
Reasoning models seem to address a lot of these. I don't see o3 failing on his recent gotchas. He could come up with new ones, but they're already pushing up against the limits of a normal teenager.
Plus we're 3 years from this resolving and 2.5yr since the release of chatgpt
What type of LLMs, @ScottAlexander ?
Transformer based? SSMs? MOEs?
What if transformer based LLMs are no longer the SOTA by then = Is attention all you need? (transformers SOTA in 2027)61% ?
Architecture invariant?
Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?
What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?
i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)
i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.
perhaps the unreasonable part is where he didn't explain his thought process. but people get busy
this market and friends would probably be better off as a poll due the legion amount of ambiguities.
I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.
ie: don't take them too seriously.
If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)
Doesn't seem we're getting clarification on this, so I've made a duplicate of this market that removes the "bizarre hacking like tricks" exception.
@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?
"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.