Will LLMs be better than typical white-collar workers on all computer tasks before 2026?

1.5kṀ35k

Dec 31

14%

chance

ALL

Richard Ngo predicted on Twitter that, before 2026, LLMs (or whatever SOTA large models are at the time) will be able to beat any human on any computer task a typical white-collar worker can do in 10 minutes.

In this question, I will attempt to grade whether this prediction is true based on my best subjective judgement in Jan 2026, by asking people to come up with counterexamples that I then evaluate.

For the purposes of this question, all Manifold users will be considered by default to be "typical white-collar workers" unless it is clear to me that the Manifold user in particular has some sort of very unusual skill that I wouldn't expect >10% of Manifold users to have.

I will rely on my subjective judgement to evaluate the credibility of cases. In the case this question is to resolve, I will allow 48 hours of discussion before resolving.

I will not personally be trading on this market because it relies on my subjective judgement.

Note that this prediction will likely only resolve based on publicly available large models, so it's possible that it may resolve NO even if Ngo was techinically right.

Economics

LLMs

Get

1,000

to start trading!

People are also trading

Will an LLM do a task that the user hadn't requested in a notable way before 2026?

91% chance

Will RL work for LLMs "spill over" to the rest of RL by 2026?

35% chance

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

11% chance

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

50% chance

Will Apple release its own LLM on par with state of the art LLMs before 2026?

7% chance

Will there be major breakthrough in LLM Continual Learning before 2026?

25% chance

Will LLMs be worse than human level at forecasting when they are superhuman at most things?

41% chance

Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?

70% chance

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

59% chance

Will a major technology company publicly admit to using a LLM for important decision making before 2025?

Sort by:

Does this include tasks which involve navigating to future-captcha-locked or otherwise bot-avoidant websites?

@JacobPfau @PeterWildeford

opened a Ṁ500 NO at 37% order

A set of 4-15 minute tasks (selected for other purposes) have 50% completion rate on 4o/Claude3.5. That's average case, and this question measures worst-case performance. Community seems too high here I'm at ~15% on this.

https://metr.org/blog/2024-08-06-update-on-evaluations/

Similarly:

https://openai.com/index/introducing-swe-bench-verified/

bought Ṁ250 NO

Any task here means 90%, 99% or manifold has to fail to come up with any tasks that the AI can’t do?

opened a Ṁ1,250 YES at 15% order

New limit orders up. The requirement that we restrict to skills that <=90% of Manifolders have or could learn reasonably quickly seems to cut out a lot of the long tail of task difficulty.

Today e.g. I implemented an auto save in some column filters that we have on a table. It took probably around 5-10 minutes.

No current LLM could have done that given the specific knowledge of the codebase that would have been (i.e. I am also claiming that GPT-4 with plugins + uploading code could not have done it alone.)

Given that I am right about my claim, would this count for your test, or is it based on the assumption that it is an easy task that will not require any knowledge of unique circumstances?

@DavidBolin That would count, yep

The definition of "computer task" is doing a lot of work here. How do you delineate? Is it "anything that you can achieve without moving away from computer" or rather "task that does not interact with anything outside your computer" or something else?

Some examples where definitions might conflict:

1) use e-mails to schedule a meeting with A,B and C

2) find datasets of age-based mortality in Cambodia

3) make a video call with my boss pretending I am still in my office

predictedNO

@MartinModrak Another aspect: how much prompt engineering do you allow per task? Should I be able to use basically the same instructions I would give to a human? Or is spending couple man-months on prompt engineering allowed for each family of tasks akin to:

@MartinModrak I think 1 + 2 of yours should count but not 3. Maybe anything you can achieve without moving away from your computer that doesn't involve faking a human identity.

predictedNO

If I had to pick between hiring the average Manifolder and unlimited GPT-4, I would probably already pick GPT-4. 5 manifolders are probably still more useful than GPT-4 though.

@Mira What kind of tasks are you hiring for? I still would have a strong preference for the average Manifolder.

Ngo's claim is that LLMs will be able to beat humans on all such tasks, but the title of this market makes it sound like it will resolve yes if there is any such task where LLMs beat humans, can you clarify or rename?