Will an autonomous agent resolve 90% of tasks on SWE-bench by 2025?
7
29
220
Dec 31
25%
chance

Resolves "Yes" if, at time of closure, there is an entry on the SWE-bench leaderboard (https://www.swebench.com/) with score greater or equal to 90%.

Linked Questions:

Get Ṁ600 play money
Sort by:

What if there's evidence that the training data is contaminated with the SWE-Bench tasks somehow?

@DavidFWatson That's an excellent question. Let's explore possibilities:

  • This could be included in the question, i.e. what matters is only the number on the benchmark, regardless of whether it was gamed

  • I could wait a certain amount of time to check if no controversy emerges. Feels like one month would be safe. The question then resolves yes if one month after the deadline, I judge that there is no consensus that the number was gamed. This makes the question more informative.

More related questions