Benchmark Gap #1: Once we have a language model that achieves expert human performance on all *current* major NLP benchmarks, how many years will it be before we have an AI with human-level language skills?
11
14
1k
2040
4.3
expected

First of a series of questions trying to measure the "benchmark gap": the gap between what we can measure with our benchmarks and actual human-level performance.

Current major benchmarks: GLUE, Winograd, BIG-bench, SQuAD, TriviaQA, LAMBADA, MMLU, and many more. Every benchmark in the GPT papers, PaLM, Minerva, Gopher, Chinchilla, Flan, LaMDA, etc.

  • I may add additional benchmarks in the future if someone makes a convincing case that it covers something none of the major benchmarks do.

  • I will not add any benchmarks created/published/released after the market creation date (2022-10-27).

  • After 2023-10-27 I will not expand the list of benchmarks.

  • Until then if there is some specific benchmark whose inclusion would change your bet, post it below and bet under the assumption that I will add it.

"Human-level language skills" is subjective and hard to define, but I will try anyway:

  • Note that I am not asking about a Turing test.

  • I am also not asking for any speaking/listening capabilities: I am only considered with human-level reading/writing.

Some things a "human-level" language model should definitely be able to do:

  • Write long-form fiction in any desired genre and format, with the ability to include particular plot elements, themes, characters, etc. (If certain kinds of fiction are forbidden / trained out of the model that's okay).

    • If it is legal to use AI to write long-form fiction then that fiction should be as critically and commercially successful as human fiction (assuming no significant bias against AI generated fiction, or that for fiction of initially unknown provenance the AI fiction does as well as the human fiction)

  • Produce fiction that I personally find moving (possibly after having been finetuned on my preferences)

  • Write passing essays/papers for any language-focused undergraduate course

  • Maintain a pleasant text-based conversation with a human.

    • No requirement that it be indistinguishable from a human

  • Write emails, fill out forms, schedule appointments.

  • Conduct a literature review

  • Answer any basic knowledge question the average college graduate can answer

  • Generally perform any kind of written communication about as well as most humans, without necessarily perfectly imitating humans and with an exception for scenarios where it being an AI causes bias against it (for instance it does not have to be human-level at getting people to fall in love with it)

  • Do all of the above in the top 10 most used languages on the internet

If you feel there are important gaps in this list of capabilities feel free to make suggestions. When making bets you should assume that this list will expand over time.

Get Ṁ600 play money
Sort by:

Beware isolated demands for AI-genius

“Some things a "human-level" language model should definitely be able to do:

  • Write long-form fiction in any desired genre and format, with the ability to include particular plot elements, themes, characters, etc. ”

There are zero humans even remotely capable of this.

(Or conversely, you can almost hack GPT-3 into doing this as well as the average college “graduate”)

@Gigacasting Yes. AI and humans are not the same and I am not asking for an AI that is exactly even with humans.

I will accept an AI that can be finetuned to, say, write very good plays in a particular style while also maintaining human level performance on most other things.

More related questions