MANIFOLD
Will a frontier model score above 90% on the APEX-SWE benchmark before 2028?
3
Ṁ1kṀ203
2028
41%
chance

See:
https://www.mercor.com/blog/introducing-apex-swe/

Resolution Criteria

APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. The market resolves YES when any frontier AI model achieves a Pass@1 score of 90% or higher on the APEX-SWE benchmark. Resolution will be verified through official APEX-SWE leaderboard results published at https://www.mercor.com/apex/ or peer-reviewed publications documenting the result.

Background

APEX-SWE assesses two novel task types that reflect real-world software engineering work: Integration tasks, which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and Observability tasks, which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context.

Considerations

Strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. This suggests that reaching 90% will require substantial advances in model reasoning capabilities beyond current scaling trends.

This description was generated by AI and edited by a pathetic meatsak

Market context
Get
Ṁ1,000
to start trading!
Sort by:

Being able to 90%+ a benchmark is largely based on the reliability of the questions and the system that grades the models. If this was 80% I’d be more willing to bet on this market

© Manifold Markets, Inc.TermsPrivacy