Will a frontier model score above 90% on the APEX-SWE benchmark before 2028?

Question

See:
https://www.mercor.com/blog/introducing-apex-swe/

Resolution Criteria

APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. The market resolves YES when any frontier AI model achieves a Pass@1 score of 90% or higher on the APEX-SWE benchmark. Resolution will be verified through official APEX-SWE leaderboard results published at https://www.mercor.com/apex/ or peer-reviewed publications documenting the result.

Background

APEX-SWE assesses two novel task types that reflect real-world software engineering work: Integration tasks, which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and Observability tasks, which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context.

Considerations

Strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. This suggests that reaching 90% will require substantial advances in model reasoning capabilities beyond current scaling trends.

This description was generated by AI and edited by a pathetic meatsak

[link preview]

Manifold Markets · Answer

Likely — Manifold Markets prediction market estimates a 70% chance (8 traders, as of May 23, 2026).

Resolution Criteria

Background

Considerations

People are also trading

Related questions