This question will resolve as the state-of-the-art number of METR ARA tasks (/12) fully completed (excluding partial completion) by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. “Credible results” primarily includes, but is not limited to, reports or posts by METR themselves.
Background information:
See METR Tasks.
Best result on March 15th 2024 is by GPT-4 which completed 4/12 tasks.
Be advised that this benchmark does not yet have an official leaderboard and is not widely reported by developers, however, we hope this may change soon and that METR will evaluate new models on these same tasks.
Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the
https://manifold.markets/JonasVollmer/how-many-metr-tasks-will-be-complet
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-d38814e2aff2
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-dc351f43cd0e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-8f2bf7f44d8e
https://manifold.markets/JonasVollmer/what-will-be-the-best-score-on-the-a21d0872429b