Background
The GSO (Global Software Optimization) benchmark tests whether large-language-model agents can act like real performance engineers.
It extracts 102 optimisation tasks from the commit histories of 10 production code-bases spanning C, C++, Rust, Go, Java, and Python.
For each task the agent receives the full repo, a correctness test-suite, and a runtime profiler, then must submit a single patch that (i) keeps all tests green and (ii) reaches ≥ 95 % of the speed-up achieved by the human expert commit.
The headline metric, Opt @ 1, is simply the fraction of tasks cleared on the very first try; no retries, no cherry-picking. All patches are rebuilt and timed in a sandbox so scores are reproducible and publicly displayed on the GSO leaderboard.
State of play (July 2025)
O3 (high) + OpenHands: 8.8% [Opt @ 1, First Attempt]
Even the SOTA clears fewer than 1 in 12 optimization tasks on its first attempt, underscoring how far current agents lag behind expert human engineers.
Why the 95% milestone matters
Systems-level intelligence: Unlike bug-fixing datasets (e.g., SWE-bench), GSO demands profiling, bottleneck localization, algorithmic redesign, and low-level code changes that truly speed up runtimes.
Real-world economic impact: Cutting CPU time by the same margin as a senior performance engineer can slash energy bills and hardware spend in data centers—hitting 95% on GSO would signal near-production readiness.
Clear head-room: Jumping from today’s 8.8 % → 95 % requires a >10 × improvement, a crisp yard-stick to track breakthroughs in code-generation, compiler reasoning, and RL-guided search.
Resolution criteria
The market resolves to the first calendar year in which all of the following hold:
Score threshold – the Opt @ 1 column on the public GSO leaderboard shows ≥ 95 % over the full official task set.
Public verification – the result is confirmed by either
a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS, ICSE) that includes the full evaluation logs, or
acceptance by the GSO maintainers as an official leaderboard entry.
Autonomy – after the run starts, humans may not alter the code; unlimited compute, external tools, or web search are allowed only if invoked autonomously by the agent.
Expiry – if no qualifying run is verified by Jan 1, 2033 the market resolves “Not Applicable.