https://www.vgbench.com
From the webpage:
"tldr;
We introduce a research preview of VideoGameBench, a benchmark which challenges vision-language models to complete, in real-time, a suite of 20 different popular video games from both hand-held consoles and PC.
We also introduce VideoGameBench-Lite, a subset of the games where the environment pauses the game while the model is thinking, thereby ignoring the long inference latency bottleneck of modern vision-language models (VLMs).
Our benchmark focuses entirely on whether VLM agents can beat these games in their entirety, given only raw visual frames from the game. In this research preview, we provide code, explanations of our framework, and initial observations of our basic agent playing these games."
"It becomes apparent after running an agent on any of these games that VLM agents are not close to solving an entire game, let alone even the first level of most games."
Resolves YES when there's a single model that can complete at least 75% (15/20) games in the full version of the benchmark.