In what year will AI achieve a score of 95% or higher on the GSO benchmark?
1
225Ṁ15
2033
February 2, 2029
10%
2025
13%
2026
13%
2027
13%
2028
13%
2029
13%
2030
13%
2031
13%
2032

Background

The GSO (Global Software Optimization) benchmark tests whether large-language-model agents can act like real performance engineers.
It extracts 102 optimisation tasks from the commit histories of 10 production code-bases spanning C, C++, Rust, Go, Java, and Python.
For each task the agent receives the full repo, a correctness test-suite, and a runtime profiler, then must submit a single patch that (i) keeps all tests green and (ii) reaches ≥ 95 % of the speed-up achieved by the human expert commit.
The headline metric, Opt @ 1, is simply the fraction of tasks cleared on the very first try; no retries, no cherry-picking. All patches are rebuilt and timed in a sandbox so scores are reproducible and publicly displayed on the GSO leaderboard.

State of play (July 2025)

O3 (high) + OpenHands: 8.8% [Opt @ 1, First Attempt]

Even the SOTA clears fewer than 1 in 12 optimization tasks on its first attempt, underscoring how far current agents lag behind expert human engineers.

Why the 95% milestone matters

  • Systems-level intelligence: Unlike bug-fixing datasets (e.g., SWE-bench), GSO demands profiling, bottleneck localization, algorithmic redesign, and low-level code changes that truly speed up runtimes.

  • Real-world economic impact: Cutting CPU time by the same margin as a senior performance engineer can slash energy bills and hardware spend in data centers—hitting 95% on GSO would signal near-production readiness.

  • Clear head-room: Jumping from today’s 8.8 % → 95 % requires a >10 × improvement, a crisp yard-stick to track breakthroughs in code-generation, compiler reasoning, and RL-guided search.

Resolution criteria

The market resolves to the first calendar year in which all of the following hold:

  1. Score threshold – the Opt @ 1 column on the public GSO leaderboard shows ≥ 95 % over the full official task set.

  2. Public verification – the result is confirmed by either

    • a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS, ICSE) that includes the full evaluation logs, or

    • acceptance by the GSO maintainers as an official leaderboard entry.

  3. Autonomy – after the run starts, humans may not alter the code; unlimited compute, external tools, or web search are allowed only if invoked autonomously by the agent.

  4. Expiry – if no qualifying run is verified by Jan 1, 2033 the market resolves “Not Applicable.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy