In this tweet (https://twitter.com/ajeya_cotra/status/1684358475416064001?s=20), Ajeya Cotra (admirably) predicted that there's >50% chance >50% of the tasks in the newly announced WebArena benchmark will be solved by a single agent. Note that Ajeya didn't specify that a single agent had to solve all of them but I will resolve based on that, so there is the possibility of divergence.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ481 | |
2 | Ṁ25 | |
3 | Ṁ8 | |
4 | Ṁ7 | |
5 | Ṁ3 |
For a baseline of current status: the paper author's tweet thread
Completing such realistic tasks is challenging. Our best GPT-4 agent achieves a limited end-to-end task success rate of 10.59%
Understanding HTML with Large Language Models provides some evidence that bidirectional encoder-decoder models outperform GPTs on understanding raw web page HTML, but this benchmark includes more than that:
raw web page html
pixel-based screenshot
accessibility tree of the webpage. Seems like this is a subset of the html DOM tree