Will an open-source system match or exceed Devin's 13.9% score on SWE-bench (unassisted) by EOY 2024?
I will define a system as "open-source" if:
its code (inference code, agent framework, etc) is publicly available under an open-source license
it uses a model which is reasonably available to the general public via an API (e.g. GPT-4, Claude-3 Opus, Gemini 1.5 Pro) OR
Specifically a language model API. I don't know exactly how to define this, but just using Devin via an API would certainly not count. The current OpenAI completions/chat completions API is fine. Anything doing lots of extra inference (for tree search, chain of thought, etc) on the API side is not.
it uses a model with weights available under a license allowing most personal use (e.g. the LLaMA 2 license, which is not strictly open source)
https://github.com/princeton-nlp/SWE-agent
Beating Devin is such a low goal. Princeton has already got 12.29 vs 13.84