The SWE-bench is a benchmark developed to evaluate if language models can resolve real-world GitHub issues. The leaderboard showcases various models and their performances in terms of the percentage of SWE-bench instances they resolved. Each instance in the SWE-bench represents a GitHub issue. The leaderboard is categorized into two main sections: Unassisted and Assisted.
Assisted: In this category, models are evaluated with the "oracle" retrieval setting. This setting provides the model with the correct files to edit, allowing the benchmark to primarily focus on a model's patch generation ability.
This question is only about the Assisted category of this benchmark.
http://www.swebench.com/#
Current SOTA is <5%
The prediction market will resolve based on the SWE-bench leaderboard standings as of 11th October 2024.
In the extremely unlikely case that the number would fit in two intervals, the lowest will be chosen.