Will >50% of the tasks in the WebArena benchmark be solved by EOY 2024?

15

1kṀ2350

resolved Dec 18

Resolved

YES

1H

6H

1D

1W

1M

ALL

In this tweet (https://twitter.com/ajeya_cotra/status/1684358475416064001?s=20), Ajeya Cotra (admirably) predicted that there's >50% chance >50% of the tasks in the newly announced WebArena benchmark will be solved by a single agent. Note that Ajeya didn't specify that a single agent had to solve all of them but I will resolve based on that, so there is the possibility of divergence.

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ481
2		Ṁ25
3		Ṁ8
4		Ṁ7
5		Ṁ3

Sort by:

Any reason this (blog post) shouldn't qualify to resolve to "Yes"?

The official WebArena leaderboard also now shows Jace with a >50% result.

For a baseline of current status: the paper author's tweet thread

Completing such realistic tasks is challenging. Our best GPT-4 agent achieves a limited end-to-end task success rate of 10.59%

Understanding HTML with Large Language Models provides some evidence that bidirectional encoder-decoder models outperform GPTs on understanding raw web page HTML, but this benchmark includes more than that:

raw web page html
pixel-based screenshot
accessibility tree of the webpage. Seems like this is a subset of the html DOM tree

People are also trading

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2026?

Will an AI model surpasses o3's matharena.ai 88% Overall score by July 1, 2025?

Will an AI System Solve One of the Remaining Millennium Prize Problems by June 2025?

Will an AI score over 80% on FrontierMath Benchmark in 2025

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2027?

Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

Will any AI solve more than four of AI 2027 Marcus-Brundage tasks in 2025?

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

Related questions

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2026?

Will an AI model surpasses o3's matharena.ai 88% Overall score by July 1, 2025?

Will an AI System Solve One of the Remaining Millennium Prize Problems by June 2025?

Will an AI score over 80% on FrontierMath Benchmark in 2025

Will an autonomous agent resolve 90% of tasks on SWE-bench by 2027?

Will an LLM agent complete >50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

Will any AI solve more than four of AI 2027 Marcus-Brundage tasks in 2025?

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

© Manifold Markets, Inc.•Terms•Privacy