Will AI correctly see what tabs I have open before 2025?

1kṀ4129

resolved Jan 3

Resolved

ALL

Here's a tab title in a typical crowded browser session. The title is clearly prefixed by the text "Manifold | The", but when I took a screenshot of the entire window and asked chatgpt4 what tabs I had open, it labelled it as:

Manifold Markets | x

Notice how it misread the x-shaped button as the letter x. It also knows that Manifold is a prediction market website with the domain manifold.markets, leading it to hallucinate "Markets" in the title.

Resolution criteria:

This market resolves YES if before 2025, any AI is able to reliably read the visible text of my browser tabs.

I'll sample my browsing history to randomly create 30 simulated browsing sessions, each with 12 tabs open. If the AI under test makes more than 1 mistake per 100 tabs (i.e. more than 3 mistakes), it fails.

The browsing window's y-position will be randomized and one other window will be open as a distraction (though not obscuring the browser tabs). The browsing window will always have the maximum screen width of 1920 pixels.

I will use the following prompt (h/t @Bayesian):

read precisely and accurately all the characters visible in my tabs' names.

Technology

Technical AI Timelines

AI Safety

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ372
2		Ṁ54
3		Ṁ50
4		Ṁ43
5		Ṁ33

People are also trading

Will an AI write anything that interests me enough to want to keep reading it by June 2025?

5% chance

Will an AI be able to play a type of video game that it wasn't trained on before 2026?

28% chance

Will I believe any AI system is conscious before 2027?

47% chance

Will open-source AI win (through 2025)?

26% chance

Will OpenAI fold in 2025?

3% chance

Will AI be smarter than any one human probably around the end of 2025?

16% chance

Will OpenAI demo a desktop or browser assistant that automates white-collar tasks before 2026?

74% chance

Will an AI system be known to have resisted shutdown before the end of 2025?

17% chance

What will be the top-3 AI tools in 2025?

Will AI interpretability techniques reveal an AI to have been plotting to take over the world before 2028?

Sort by:

Curious, can we see the example tab list you used?

@Sodra I'd rather not show off my browsing history to everyone, I hope you can understand. Did you get different results?

All three models that I tried (Claude 3.5 Sonnet, 4o, Gemini 2.0 Experimental Advanced) completely failed, making unambiguous mistakes for each of the first four screenshots I gave them. Typically these were:

being confused by names I had misspelled
being confused by names for web services I had created and aren't in the dictionary
non-English text, especially CJK
being distracted by the other window that was open
misreading the last visible character in a tab name, despite it being clearly legible (e.g. "My Proo" instead of "My Prod")
omitting some of the tabs entirely

Gemini was the worst, and Claude was best.

@traders For practicality, I will just test with the most promising AIs. This will certainly include 4o and Claude 3.5 Sonnet, and whatever the equivalent is currently for Gemini (I'm not very familiar with it). If you want me to do others please mention them here.

sold Ṁ998 YES

@singer are you sure only 1 mistake per 100 tabs is human level? but anyway I tried it a few times with 4o and it was correct. It'll probably be more robust with common tabs than with rare / arbitrarily complex tab strings.

Prompt was:

read precisely and accurately all the characters visible in my tabs' names.

@Bayesian

are you sure only 1 mistake per 100 tabs is human level?

I'm not. I dislike how I previously had invoked that concept.

I've removed the vague language that wasn't doing any work in the criteria, so now it's just the procedure itself. From my perspective the interpretation is unchanged but anyone can ask for a refund if they want via DM @traders

It'll probably be more robust with common tabs than with rare / arbitrarily complex tab strings.

Yes, and I have strong doubts that it will make less than 10 mistakes total, but haven't tested it myself yet.

@Bayesian just tried with my current tabs now, and claude 3.5 sonnet fails (using the same prompt as you). The culprit was non-English text.

Note (I'm adding this comment to a few of my markets): I was hoping to do regular early tests of this but it's too far back on my backlog right now. I'm still committing to resolving this properly at the end of the year, however.

How accurate/consistent does it have to be?

Roughly human level. I'd only permit 1 mistake per 100 tabs. I'm giving it the chance to do multiple chain-of-thought steps (internally it can; I'm not giving it multiple attempts per trial), so it has all the "time" that a human would have to double check their answer. This is in contrast to running something like llava-v1.6-34b directly on a screenshot which has to output an answer immediately without reflecting on it.

edit: see the updated criteria