This question will resolve as the state-of-the-art success rate (SR) with no UA Hint on the WebArena benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. Credible sources include but are not limited to blog posts, arXiv preprints, and papers.
Background information:
See WebArena.
WebArena is a standalone, self-hostable web environment for building autonomous agents. WebArena introduces a benchmark on interpreting high-level realistic natural language command to concrete web-based interactions. We provide annotated programs designed to programmatically validate the functional correctness of each task. See the paper and specifically section 5.1 for results.
Best publicly reported score on March 15th 2024 is GPT-4 based and achieved 14.41%.
Be advised that this benchmark does not yet have an official leaderboard and is not widely reported by developers, however, we hope this may change soon given that it seems like a high quality and important benchmark.
Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the
https://manifold.markets/JonasVollmer/how-many-metr-tasks-will-be-complet
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-d38814e2aff2
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-dc351f43cd0e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-8f2bf7f44d8e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-a21d0872429b