Background
OSWorld is a benchmark for evaluating multimodal AI agents on real-world computer tasks in open-ended environments. It tests an AI's ability to navigate operating systems, use applications, and complete practical tasks through a combination of vision and text inputs/outputs.
As of January 24, 2025, the highest OSWorld score is held by OpenAI CUA (200 steps) with a score of 38.1. Other notable scores include:
UI-TARS-72B-DPO (50 steps): 24.6
UI-TARS-72B-DPO (15 steps): 22.7
Claude 3.5 Sonnet (50 steps): 22.0
Resolution Criteria
This market will resolve to the highest verified OSWorld score achieved by any AI model during the 2025 calendar year (January 1, 2025 to December 31, 2025). The score must be publicly reported and verifiable through official sources such as the OSWorld leaderboard, academic publications, or credible tech news outlets.
If multiple models achieve the same highest score, the market will resolve to that score. If scores are reported with different decimal precisions, they will be considered at their reported precision.