Benchmark Gap #1: Once we have a language model that achieves expert human performance on all *current* major NLP benchmarks, how many years will it be before we have an AI with human-level language skills?

1kṀ714

2040

4.3

expected

ALL

First of a series of questions trying to measure the "benchmark gap": the gap between what we can measure with our benchmarks and actual human-level performance.

Current major benchmarks: GLUE, Winograd, BIG-bench, SQuAD, TriviaQA, LAMBADA, MMLU, and many more. Every benchmark in the GPT papers, PaLM, Minerva, Gopher, Chinchilla, Flan, LaMDA, etc.

I may add additional benchmarks in the future if someone makes a convincing case that it covers something none of the major benchmarks do.
I will not add any benchmarks created/published/released after the market creation date (2022-10-27).
After 2023-10-27 I will not expand the list of benchmarks.
Until then if there is some specific benchmark whose inclusion would change your bet, post it below and bet under the assumption that I will add it.

"Human-level language skills" is subjective and hard to define, but I will try anyway:

Note that I am not asking about a Turing test.
I am also not asking for any speaking/listening capabilities: I am only considered with human-level reading/writing.

Some things a "human-level" language model should definitely be able to do:

Write long-form fiction in any desired genre and format, with the ability to include particular plot elements, themes, characters, etc. (If certain kinds of fiction are forbidden / trained out of the model that's okay).
- If it is legal to use AI to write long-form fiction then that fiction should be as critically and commercially successful as human fiction (assuming no significant bias against AI generated fiction, or that for fiction of initially unknown provenance the AI fiction does as well as the human fiction)
Produce fiction that I personally find moving (possibly after having been finetuned on my preferences)
Write passing essays/papers for any language-focused undergraduate course
Maintain a pleasant text-based conversation with a human.
- No requirement that it be indistinguishable from a human
Write emails, fill out forms, schedule appointments.
Conduct a literature review
Answer any basic knowledge question the average college graduate can answer
Generally perform any kind of written communication about as well as most humans, without necessarily perfectly imitating humans and with an exception for scenarios where it being an AI causes bias against it (for instance it does not have to be human-level at getting people to fall in love with it)
Do all of the above in the top 10 most used languages on the internet

If you feel there are important gaps in this list of capabilities feel free to make suggestions. When making bets you should assume that this list will expand over time.

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

29% chance

Benchmark Gap #3: Once a model achieves superhuman performance on a competitive programming benchmark, will it be less than 2 years before there are "entry level" AI programmers in industry use?

73% chance

Benchmark Gap #5: Once a single AI model solves >= 95% of miniF2F, MATH, and MMLU STEM, will it be less than two years before AI models are used as entry-level data science / data analysis / statistics workers?

67% chance

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

34% chance

Benchmark Gap #9: Once a model solves current software engineering benchmarks, how long until humans don't code?

7.6

Benchmark Gap #6: Once we have a transfer model that achieves human-level sample efficiency on many major RL environments, how many months will it be before we have a non-transfer model that achieves the same?

Benchmark Gap #2: Once we have an algorithm with human level sample efficiency for major RL benchmarks, how many years will it be before there is an algorithm with human level sample efficiency on essentially all AAA video game tasks?

1.6

In what year will an OpenAI natural language processing model be more competent than me in my area of expertise?

2033

By 2030, will large language models still be at the peak of AI? [DRAFT]

25% chance

Best available language model from an OpenAI competitor by 2026

Sort by:

Beware isolated demands for AI-genius

“Some things a "human-level" language model should definitely be able to do:

Write long-form fiction in any desired genre and format, with the ability to include particular plot elements, themes, characters, etc. ”

There are zero humans even remotely capable of this.

(Or conversely, you can almost hack GPT-3 into doing this as well as the average college “graduate”)

@Gigacasting Yes. AI and humans are not the same and I am not asking for an AI that is exactly even with humans.

I will accept an AI that can be finetuned to, say, write very good plays in a particular style while also maintaining human level performance on most other things.