Benchmark Gap #2: Once we have an algorithm with human level sample efficiency for major RL benchmarks, how many years will it be before there is an algorithm with human level sample efficiency on essentially all AAA video game tasks?

390Ṁ577

2040

1.6

expected

ALL

For the purposes of this question, major RL benchmarks are ALE, Minecraft, chess, go, and Starcraft II.

Sample efficiency: the number of frames/games/amount of time required to achieve a given level of performance. For this market I will use average human performance: the algorithm must achieve average human performance (measured by score/ELO/time/etc) given the same amount of data.

Video game tasks could include: maximizing score, speed runs, challenge runs, or competing against human players.

I'm restricting the resolution to AAA video games to avoid possibilities like an indie developer making a Turing test video game.

"Essentially all":

Can complete >=90% of AAA video games in <= mean human completion time
Can achieve a top 100 speedrun (according to whatever the largest speedrun website at the time is) on >=90% of AAA video games given approximately the same amount of time as human speed runners
Can complete popular challenge runs on >=90% of AAA video games

The models used can include pretraining as long as the training data does not include frames from the video games. Instructions/manuals/guides can also be used, as long as they are available to human players (e.g. the contents of a speedrunners forum or a youtube video explaining a trick can be part of the input).

Note: this question is about algorithms rather than models. There is no requirement that a single model be able to play multiple video games. In cases where a single model is trained to play multiple video games, I will use its average sample efficiency across all those games.

Technical AI Timelines

Get

1,000

to start trading!

1 Comment

4 Holders

13 Trades

Sort by: