Background
SWE-bench Verified is a 500-task, human-vetted slice of the SWE-bench dataset that removes ambiguous or unsolvable issues. Each task corresponds to a real GitHub bug-fix; success is measured only by whether the submitted patch makes all repository tests pass. Reaching 95 % would imply an agent that can reliably read unfamiliar codebases, localise bugs, implement multi-file patches, and satisfy rigorous unit tests—approaching or surpassing strong human-engineer performance on day-to-day bug-fixing.
Resolution Criteria
This market resolves to the year-bracket in which a fully automated AI system first records an average accuracy of 95 % or higher on the SWE-bench Verified benchmark.
Verification – The claim must be confirmed by either
a peer-reviewed paper on arXiv, or
an official entry on the public SWE-bench leaderboard (e.g. swebench official website or the HAL leaderboard or another credible source).
Compute resources – Unlimited.
Fine Print:
If the resolution criteria are unsatisfied by Jan 1, 2033 the market resolves to “Not Applicable.”