Top OSWorld score in 2025?

21

Ṁ10kṀ74k

resolved Jan 2

Resolved

~65.0 %

1H

6H

1D

1W

1M

ALL

Resolved

YES

Above 50%

Resolved

YES

Above 60%

Resolved

NO

Above 90%

Resolved

NO

Above 80%

Resolved

NO

Above 70%

Background

OSWorld is a benchmark for evaluating multimodal AI agents on real-world computer tasks in open-ended environments. It tests an AI's ability to navigate operating systems, use applications, and complete practical tasks through a combination of vision and text inputs/outputs.

As of January 24, 2025, the highest OSWorld score is held by OpenAI CUA (200 steps) with a score of 38.1. Other notable scores include:

UI-TARS-72B-DPO (50 steps): 24.6
UI-TARS-72B-DPO (15 steps): 22.7
Claude 3.5 Sonnet (50 steps): 22.0

Resolution Criteria

This market will resolve to the highest verified OSWorld score achieved by any AI model during the 2025 calendar year (January 1, 2025 to December 31, 2025). The score must be publicly reported and verifiable through official sources such as the OSWorld leaderboard, academic publications, or credible tech news outlets.

If multiple models achieve the same highest score, the market will resolve to that score. If scores are reported with different decimal precisions, they will be considered at their reported precision.

Market context

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ2,626
2		Ṁ1,430
3		Ṁ486
4		Ṁ406
5		Ṁ231

People are also trading

What will be the best score (almost resolved) on ProgramBench at the end of 2026?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2043?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2033?

What will be the highest Epoch Capabilities Index score in 2026?

Will 2026 be the year of the Linux desktop?

What will be the best GSOBench score by Dec 31, 2026?

Top score on Humanity's Last Exam > 50% by 2029?

Microsoft ships a Linux desktop by 2032?

Sort by:

resolve no for all

@prismatic huh? It's 72.6% if you click "all". Resolution doesn't say limited to foundation e2e

@Usaar33 it says "any AI model" and the 72.6% score is from an AI system, not AI model by the conventional meaning of those terms

Will OSWorld Verified count toward this market? Seems that it's what labs are using now:

https://x.com/justjoshinyou13/status/1972721166277111948

@SG 50% and 60% resolve YES if you do count OSWorld Verified (and i def think you should)

People are also trading

What will be the best score (almost resolved) on ProgramBench at the end of 2026?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2043?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2033?

What will be the highest Epoch Capabilities Index score in 2026?

Will 2026 be the year of the Linux desktop?

What will be the best GSOBench score by Dec 31, 2026?

Top score on Humanity's Last Exam > 50% by 2029?

Microsoft ships a Linux desktop by 2032?

Related questions

What will be the best score (almost resolved) on ProgramBench at the end of 2026?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2043?

Will a non-Windows, non-Unix-like operating system capture greater than 3% market share by 2033?

What will be the highest Epoch Capabilities Index score in 2026?

Will 2026 be the year of the Linux desktop?

What will be the best GSOBench score by Dec 31, 2026?

Top score on Humanity's Last Exam > 50% by 2029?

Microsoft ships a Linux desktop by 2032?