What will be the best score on the WebArena benchmark before 2025?

This question will resolve as the state-of-the-art success rate (SR) with no UA Hint on the WebArena benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. Credible sources include but are not limited to blog posts, arXiv preprints, and papers.

Background information:

See WebArena.

WebArena is a standalone, self-hostable web environment for building autonomous agents. WebArena introduces a benchmark on interpreting high-level realistic natural language command to concrete web-based interactions. We provide annotated programs designed to programmatically validate the functional correctness of each task. See the paper and specifically section 5.1 for results.

Best publicly reported score on March 15th 2024 is GPT-4 based and achieved 14.41%.

Be advised that this benchmark does not yet have an official leaderboard and is not widely reported by developers, however, we hope this may change soon given that it seems like a high quality and important benchmark.

Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks.

