Will AI correctly see what tabs I have open before 2025?
48
171
890
2025
55%
chance

Here's a tab title in a typical crowded browser session. The title is clearly prefixed by the text "Manifold | The", but when I took a screenshot of the entire window and asked chatgpt4 what tabs I had open, it labelled it as:

Manifold Markets | x

Notice how it misread the x-shaped button as the letter x. It also knows that Manifold is a prediction market website with the domain manifold.markets, leading it to hallucinate "Markets" in the title.

Resolution criteria (provisional):

This market resolves YES if before 2025, an AI is able to reliably read the visible text of my browser tabs given an arbitrary screenshot, with less than 15 open tabs (I've settled on using 12 tabs for testing). Other windows may be open as well, and the browser window doesn't necessarily need to be at the top of the screen.

An OCR tool that outputs all of the text in the screenshot isn't enough. It needs to specifically answer the question "What is the title of each browser tab I have open?"

Update:

The required accuracy/consistency should be roughly human level. I'll sample my browsing history to randomly create 30 simulated browsing sessions, each with 12 tabs open. If it makes more than 1 mistake per 100 tabs (i.e. more than 3 mistakes), it fails and this market resolves NO.

The browsing window's y-position will be randomized and one other window will be open as a distraction (though not obscuring the browser tabs). The browsing window will always have the maximum screen width of 1920 pixels.

If this setup doesn't make sense please leave a comment. I'll try to publish a baseline of the current chatgpt4 performance on this setup before Q2 of this year.

Get Ṁ200 play money
Sort by:

Note (I'm adding this comment to a few of my markets): I was hoping to do regular early tests of this but it's too far back on my backlog right now. I'm still committing to resolving this properly at the end of the year, however.

How accurate/consistent does it have to be?

Roughly human level. I'd only permit 1 mistake per 100 tabs. I'm giving it the chance to do multiple chain-of-thought steps (internally it can; I'm not giving it multiple attempts per trial), so it has all the "time" that a human would have to double check their answer. This is in contrast to running something like llava-v1.6-34b directly on a screenshot which has to output an answer immediately without reflecting on it.

edit: see the updated criteria

opened a Ṁ1,000 NO at 50% order

@singer thanks!