Background
PhysBench is a 10 k‑item, video‑image‑text benchmark that tests whether a vision–language model (VLM) can reason about the real‑world physics that governs everyday objects and scenes. It covers four domains—object Properties, object Relationships, Scene understanding and future‑state Dynamics—split into 19 fine‑grained tasks such as mass comparison, collision outcomes and fluid behaviour.
State of play:
• Human reference accuracy: 95.87 %
• Frontier AI as of Dec 2024 (InternVL 2.5‑38B): 51.94 %
Why reaching human‑level on PhysBench is a big milestone:
Physics‑consistent video generation – A model that masters all four PhysBench domains should be able to create long‑form videos, ads or even feature films in which liquids pour, cloth folds and shadows move exactly as they would in the real world, eliminating today’s “physics mistakes” seen in AI generated videos. PhysBench is the litmus test for whether next‑generation multimodal models can move from “smart autocomplete” to physically grounded intelligence—a prerequisite for everything from autonomous robots to cinematic movies.
Resolution Criteria
This market resolves to the year bracket in which a fully automated AI system first achieves an average accuracy of 95% or higher (“human‑level”) on the PhysBench ALL metric.
Verification – Must be confirmed by a peer‑reviewed or arXiv paper or an independent leaderboard entry (e.g. LM‑Eval Harness, PapersWithCode).
Compute resources – Unrestricted.
If no AI model reaches 95.9 % by 31 Dec 2041, the market resolves to “Not Applicable.”