
For the purposes of this question, major RL benchmarks are ALE, Minecraft, chess, go, and Starcraft II.
Sample efficiency: the number of frames/games/amount of time required to achieve a given level of performance. For this market I will use average human performance: the algorithm must achieve average human performance (measured by score/ELO/time/etc) given the same amount of data.
Video game tasks could include: maximizing score, speed runs, challenge runs, or competing against human players.
I'm restricting the resolution to AAA video games to avoid possibilities like an indie developer making a Turing test video game.
"Essentially all":
Can complete >=90% of AAA video games in <= mean human completion time
Can achieve a top 100 speedrun (according to whatever the largest speedrun website at the time is) on >=90% of AAA video games given approximately the same amount of time as human speed runners
Can complete popular challenge runs on >=90% of AAA video games
The models used can include pretraining as long as the training data does not include frames from the video games. Instructions/manuals/guides can also be used, as long as they are available to human players (e.g. the contents of a speedrunners forum or a youtube video explaining a trick can be part of the input).
Note: this question is about algorithms rather than models. There is no requirement that a single model be able to play multiple video games. In cases where a single model is trained to play multiple video games, I will use its average sample efficiency across all those games.