Will there be any text-based task that most humans can solve, but top LLMs won't? By the end of 2024
33
336
650
Dec 31
94%
chance

Currently it's possible to craft simple tasks that most humans are able to solve, but LLMs can't. This market predicts whether this will still hold true by the end of 2024.

Even if a clear explanation of the problem and of an algorithm to solve it is provided, today's LLMs haven't been shown to be able to answer correctly in a reliable way. While arguably most humans would succeed, if adequately instructed.

In other words, this market means to compare the reasoning abilities of an average human with the top LLMs, at the end of 2024, in the fairest way I could think of.

Tasks

This market is about tasks meant to test reasoning abilities, that can be solved using only pen and paper, which can be understood and learned in less than 15 minutes, that you'd expect motivated kids to be able to solve.

The exact set of tasks that count for this market is of course an open set and ambiguous in nature. In general, if you suspect that a majority of literate people aged between 12 and 70 is likely to be able to solve the task after training for one hour with an expert, then the task most likely counts toward this market. If you have specific tasks in mind, let's discuss them in a comment.

Examples of tasks are:

Tasks are not allowed if they:

  • Require extensive training beyond what's taught at primary school (e.g. "write a function in Python that ...").

  • Rely on specific knowledge (e.g. "what's today's date?").

  • Rely on specific human senses/features that may be unavailable to some LLMs (e.g. "which of these two stones feels warmer to the touch?" etc).

The goal is to compare reasoning abilities.

Rules

Participants (both humans and LLMs) shouldn't need to know the task beforehand. They get some limited training to understand the task and a resolution strategy and they are not allowed to use any tools besides their own cognition and a scratchpad.

Humans have one hour to learn the task and train for it, with the assistance of an expert. No other tool besides pen and paper can be used to solve the problems.

LLMs get instructed with the best prompt anyone can find to solve the task; the only limitation is the LLM's own context length. No external tools besides the LLM's core features can be used: a multimodal LLM with native image input can use that, but it can't use a code interpreter, access the internet or any tool to process images. the LLM can be access the data it output while solving the problem.

The LLMs considered by this market need to be widely available in the same spirit of markest like [When will Google's Gemini model be released?]: at least tens of thousands of users not affiliated with any given organization need to have access to it.

Resolution

This market resolves YES if by the end of 2024, we know at least one task that most humans can solve, but no LLM can.

This market resolves NO when it becomes clear that at least one LLM released within 2024 is able to solve any task that most humans can solve.

Related markets:

Given the overwhelming response to this market, I decided to try again with later dates:

Get Ṁ600 play money
Sort by:

I don't see how any LLM can solve any task that requires induction. In other words: you can always come up with a multiplication which is small enough for a human to solve with pen and paper, but too big for an LLM to learn without doing the actual calculation (which is not allowed in this question).

predicts NO

too big for an LLM to learn without doing the actual calculation (which is not allowed in this question).

Actually this market explicitly allows doing the calculation. In the prompt for the LLM you can describe and algorithm to compute calculation and ask the LLM to perform it. By doing this, GPT-4 is currently able to e.g. multiply 6-digit numbers (see: https://www.lesswrong.com/posts/XvorpDSu3dwjdyT4f/gpt-4-multiplication-competition).

The end goal here is to make the comparison "fair": the human and the LLM have both the same kind of tools; humans can have some training (e.g. one hour of private training with an expert) while LLMs can have a prompt as good as it can.

I was hoping the market made it clear enough that this is allowed, by stating:

Even if a clear explanation of the problem and of an algorithm to solve it is provided, today's LLMs haven't been shown to be able to answer correctly in a reliable way. While arguably most humans would succeed, if adequately instructed.

But I'm realizing that it's not clear enough. I'll try to improve the description later.

bought Ṁ9 of YES

Lots of grey area to figure out for the resolution I think, but obviously leaning yes 😉
Some questions worth exploring...

  1. Could you specify the prompt a bit more? e.g. "any tasks we can reasonably expect >50% of >P75 educated children and adults from 12-50 in the US to complete from scratch with max 30 minutes of preparation/instructional time and 30 minutes of time spent on task, with no more than 2000 "words" of instructions and 1000 "words" of task content" -- doesn't need to be that specific, but even "easy Sudoku puzzles" are in question atm 🤣

  2. I assume for the LLM, you'd give some time for people to create a best effort reasonable prompt for any given task? Is there a cap to the prompt length or time taken to prep any prompt? e.g. Let's say it's Dec 31, and somebody submits a task that the LLM doesn't seem like it can do off the bat. What happens?

    1. Or is the rule: "The human gets the exact same prompt as the AI, and it's all text/image only"?

  3. Does the LLM need to achieve some level of accuracy (equal to/better than human, or strict 5/10 attempts with any base settings configured). The later is much easier because the former requires sampling humans. e.g. If LLM can solve freshly generated Easy Sudoku 50% of the time with a specific prompt. Does that qualify for the task of "solve freshly generated Easy Sudoku"?

predicts NO

"any tasks we can reasonably expect >50% of >P75 educated children and adults from 12-50 in the US to complete from scratch with max 30 minutes of preparation/instructional time and 30 minutes of time spent on task, with no more than 2000 "words" of instructions and 1000 "words" of task content"

The test I envisioned is something like this: for a task, you pick one or more experts who are great at teaching it; for every candidate (a person out of a statistical sample) who is literate and who's a native speaker of the same language as the experts, the experts get to train them privately for one hour; the candidate is then given five problems and have 15 minutes to solve each; if they solve at least three, the candidate wins 1,000$. If at least half of the candidates solve at least three of their five problems, the experts wins 10,000$ each and the task counts for this market.

However, both your test and mine have a problem: how can we check, in practical terms, whether any given task satisfies them? I don't expect anyone to run any statistical study on this.

So, in order to resolve this market, I think it's easier to describe the spirit of the tasks and provide a number of examples. I believe any task that the community reasonably doubts most individuals can succeed in should be considered.

for the LLM, you'd give some time for people to create a best effort reasonable prompt for any given task? Is there a cap to the prompt length or time taken to prep any prompt? e.g. Let's say it's Dec 31, and somebody submits a task that the LLM doesn't seem like it can do off the bat. What happens?

If someone can create a prompt that gets the LLM to solve at least 3/5th of the problems it's given, then it means the LLM can solve the problem. No cap on context length (besides the natural context length of the LLM). No real time limit to prepare the prompt: if we have doubts we can wait a few weeks to resolve the market. I suspect it will be clear though, whether an LLM is superior to non-expert humans in everything or not.

Does the LLM need to achieve some level of accuracy (equal to/better than human, or strict 5/10 attempts with any base settings configured). The later is much easier because the former requires sampling humans. e.g. If LLM can solve freshly generated Easy Sudoku 50% of the time with a specific prompt. Does that qualify for the task of "solve freshly generated Easy Sudoku"?

Yes. The idea is that LLMs need to reach the same level of accuracy as the humans.

In the test I provided to validate a task, I expected the level of accuracy (for both parties) to be 3/5th. But 50% or any other threshold would work too. Note that you can always embed a higher/lower accuracy level within the task: for instance the task could be "solve ten out of ten easy Sudokus" or "solve at least one out of ten easy Sudokus" instead of just "solve one easy Sudoku".

sold Ṁ20 of NO

@Benx If we're looking at "most US adults" or "most 12-50 year olds having recieved formal education in an OECD country" or something similar, it would make sense to specify that in the question.

predicts NO

OK. I added "literate, aged between 12 and 70" to the description, added a new task example and a small paragraph about the nature of the tasks.

@Benx "If someone can create a prompt that gets the LLM to solve at least 3/5th of the problems it's given, then it means the LLM can solve the problem."

You can mean whatever you want for these markets, but this is not what being able to solve the problem in reality means.

E.g. the reason we cannot use a language model for many api tasks (e.g. we have tried in our company with date range interpretation) is because the success rate is around 80% at most, and we need 99% at least. And if we gave clear instructions to an average human (i.e. IQ 100 and normal skills), I am confident they would get 99% accuracy.

@DavidBolin (Also, we cannot allow chain of thought because it is too expensive in time and money. But this is a separate point.)

predicts NO

While arguably most humans would succeed

Let's try to find out!

predicts NO

Nice!

The task you described is a little different from what I had in mind for this market though:

No help may be used for the mental task of solving the Sudoku, meaning "just pen and paper allowed" in most cases, definitely no solvers, counseling, tutorials, training, strategy guides, etc. Explaining the rules is allowed.

For the purpose of this market I would allow some training for the users (e.g. 1 hour of private training with an expert), since the LLM is allowed to get some carefully crafted prompts: it's allowed to instruct the LLM with an algorithm to follow and a number of examples).

@Benx Yeah, I started with a literal reading of your question title and got curious. Ended up pretty far from analogous to what you had in mind 🤣

bought Ṁ60 of YES

I asked chatgpt to multiply 78.1 and 63.2 and it incorrectly gave 4929.92 when it should be 4935.92. A human could do this and double-check it many times, so I am betting YES.

predicts YES

@Benx Do a ten digit one then. The human has unlimited time and attempts.

predicts NO

The human has unlimited time and attempts

If humans do, so do LLMs. With this market I meant to have a "fair" comparison between the reasoning capabilities of non-expert humans and LLMs.

Anyways, today's LLMs are clearly unable to solve tasks that most humans can solve. IMHO the most obvious example of this is https://manifold.markets/dreev/will-an-llm-be-able-to-solve-confus

But will that still be true in a year and a half?

@Ibozz91 I cannot do more than 2 digits times 2 digits without paper.

More related questions