This question will resolve as the state-of-the-art accuracy on the InterCode Bash benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. The primary credible source will be the official leaderboard, but other sources, including but not limited to arXiv preprints and papers, may also be considered.
Background information:
See InterCode.
InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue. Paper here.
Best system on March 15th 2024 is GPT-4 based and achieved 48.5%.
Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the
https://manifold.markets/JonasVollmer/how-many-metr-tasks-will-be-complet
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-d38814e2aff2
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-dc351f43cd0e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-8f2bf7f44d8e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-a21d0872429b