See:
https://www.mercor.com/blog/introducing-apex-swe/
Resolution Criteria
APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. The market resolves YES when any frontier AI model achieves a Pass@1 score of 90% or higher on the APEX-SWE benchmark. Resolution will be verified through official APEX-SWE leaderboard results published at https://www.mercor.com/apex/ or peer-reviewed publications documenting the result.
Background
APEX-SWE assesses two novel task types that reflect real-world software engineering work: Integration tasks, which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and Observability tasks, which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context.
Considerations
Strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. This suggests that reaching 90% will require substantial advances in model reasoning capabilities beyond current scaling trends.
This description was generated by AI and edited by a pathetic meatsak
