Will AI automate GUIs by end of 2024?

Current AI agents (circa Jan 2024) are quite bad at clicking, reading screenshots, and interpreting the layout of webpages and GUIs. This is expected to change in the near future, with AI capable enough to navigate an arbitrary GUI about as well as a human.

Example of an early system of this type: https://github.com/OthersideAI/self-operating-computer/tree/main?tab=readme-ov-file#demo

Resolution criteria (provisional):

This question resolves YES if, the day after 2024 ends, I can direct an AI agent to resolve this market as YES using only voice commands while blindfolded. It resolves NO if this takes over 30 minutes.


There are no restrictions on whether the AI agent is free, open source, proprietary, local, remote, etcetera.


If someone else on Manifold can demonstrate an AI agent resolving a Manifold market as YES (while following the same restrictions that I would have followed), then I'll resolve this one as YES too. This is in case I'm not able to get access to the AI agent myself for testing.


The agent will need to be able to open a web browser and login to Manifold on its own.

Get Ṁ600 play money
Sort by:


OpenAI is developing a form of agent software to automate complex tasks by effectively taking over a customer’s device. The customer could then ask the ChatGPT agent to transfer data from a document to a spreadsheet for analysis, for instance, or to automatically fill out expense reports and enter them in accounting software. Those kinds of requests would trigger the agent to perform the clicks, cursor movements, text typing and other actions humans take as they work with different apps, according to a person with knowledge of the effort.

sold Ṁ38 of YES

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (jykoh.com)

GPT-4V has a success rate of only 16.37% on web tasks, whereas human-level performance is 88.70%. Not sure whether resolving this market is one of the easier tasks, but it seems we have a way to go before AI achieves human-level web browsing.

repostedpredicts YES

Interesting question! Might implement it later

I think this can already be done by hooking up a LLM to the macOS accessibility API.

I've also seen set-of-mark used to annotate screenshots, parse options, let LLM choose option, then clicking coordinates.

Might be doable with open-interpreter even: https://github.com/KillianLucas/open-interpreter/

Maybe I'll see if I can get it working then buy all the YES.

@ErikBjareholt while I expect the tech to be available soon, I'm very skeptical that any system can achieve the criteria at this exact moment. I'd love for you to prove me wrong.

bought Ṁ50 of YES

@singer You might want to take a look at:
- https://github.com/ddupont808/GPT-4V-Act
- https://github.com/reworkd/tarsier

I'm likely going to be implementing a similar system soon (first half of 2024), so unless someone beats me to it, I'll have a go at it then.

Fun resolution criteria, I like it!

bought Ṁ20 of YES

@singer Will you be buying a Rabbit R1? They claim it can do this, and if not, that you can easily teach it to.

If not you might want to add precision, for example that it can be done using free software using a computer.

@SIMOROBO Good point. Devices/services like the R1 Rabbit and the AI pin would be eligible, and so should all premium chatgpt-like services. Even if I don't own it, as long as someone can demonstrate it having the capability in the criteria, I'll resolve this as YES.

(I'm not planning to get an R1 but if it can really do this I'll be considering it)

More related questions