What will be the best score on the InterCode (Bash) benchmark before 2025?

This question will resolve as the state-of-the-art accuracy on the InterCode Bash benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. The primary credible source will be the official leaderboard, but other sources, including but not limited to arXiv preprints and papers, may also be considered.

Background information:

See InterCode.

InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue. Paper here.

Best system on March 15th 2024 is GPT-4 based and achieved 48.5%.

Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:

