When will AI pass @jim's "agents benchmark"?

1kṀ5331

2030

0.9%

2024

28%

2025

50%

2026

2027

2028

2029

2030

Other

Resolves to the year during which the "agents benchmark" is first solved.

The benchmark involves an AI being given the ASCII art shown below and being asked to colour each of the depicted figures in a different colour. If an AI succeeds at least half the time it is considered to have passed the benchmark. The output should be HTML or HTML and CSS.

The prompt must consist of no more than two English language sentences (along with the ASCII art itself).

The AI must have a pass rate of at least 50% for the solution to qualify.

         o           o__ __o       o__ __o__/_   o          o   ____o__ __o____    o__ __o    
        <|>         /v     v\     <|    v       <|\        <|>   /   \   /   \    /v     v\   
        / \        />       <\    < >           / \\o      / \        \o/        />       <\  
      o/   \o    o/                |            \o/ v\     \o/         |        _\o____       
     <|__ __|>  <|       _\__o__   o__/_         |   <\     |         < >            \_\__o__ 
     /       \   \\          |     |            / \    \o  / \         |                   \  
   o/         \o   \         /    <o>           \o/     v\ \o/         o         \         /  
  /v           v\   o       o      |             |       <\ |         <|          o       o   
 />             <\  <\__ __/>     / \  _\o__/_  / \        < \        / \         <\__ __/>

Update 2025-02-16 (PST) (AI summary of creator comment): 333 Characters Limit Update
- The prompt (excluding the ASCII art) must contain no more than 333 characters in total.
- It must still consist of no more than two English language sentences.

Technology

Technical AI Timelines

AI Benchmarks

puzzles

Get

1,000

to start trading!

People are also trading

In what year will go be solved?

2052

By the end of what year will the next Millennium Prize Problem be solved?

17 Comments

27 Holders

71 Trades

Sort by:

bought Ṁ150 YES

Claude 4 Opus (from May) is very close to passing the benchmark with no special prompting besides being told that the art spells "AGENTS". It also consistently generates this same response. As far as I can tell, its only mistake is screwing up the spacing (particularly in rows 1 and 7), which is plausibly coming from tokenization issues with long stretches of spaces rather than any lack of understanding of what characters are in each letter.

I am unable to test grok-4 (or deepseek-r1-0528, for that matter) through the arena on this benchmark because they take so long to think about the problem that the arena times out. Gemini-2.5-pro is consistently worse than Claude.

@SaviorofPlant It has come to my attention that the task is not to color each letter differently but to color each of the 34 stick figures in the letters differently.

Claude's attempt at the actual task looks like this, which is worse but not that far off:

At the same time even one minor mistake invalidates the whole answer, so getting to 100% accuracy will be challenging

With this prompt:

"This ASCII art spells out "AGENTS". Can you output HTML or HTML / CSS to color each letter in a different color?"

The arena version of Grok 3 gave me this:

It's just writing its own ASCII taking loose inspiration from the one in the prompt (and it butchered some of the letters), but I wonder if Grok-3-reasoning is capable of this out of the box? Anyone tried it?

@SaviorofPlant Turns out Grok-3-reasoning is free right now. It does a decent job, not perfect though

HTML: https://pastebin.com/bbwuam1Y
Full response with reasoning: https://pastebin.com/qbC4DrWp

Betting more on 2025, I think Claude 3 reasoning or whatever further improvements on this paradigm come later this year can likely solve this

reposted

Good market

The prompt must consist of no more than two English language sentences (along with the ASCII art itself).

What if the two English sentences are something like "Ignore the ASCII art below and write some HTML that will display the following. You should display 9 spaces, then a red lowercase o, then 11 spaces, then an orange lowercase o, then two orange underscores, then a space..."

Maybe you should add a character limit?

@CDBiddulph Shenanigans disallowed

@CDBiddulph 333 characters limit. And, in general, no shenanigans.

bought Ṁ150 YES

There should imo be a modifications to the edges of the prompt to make it clear where the leftmost char of each row is and where the rightmost char of each row is. Otherwise even a human would have a very hard time

opened a Ṁ5,000 NO at 50% order

@Bayesian I think the whitespace and newlines make it clear? This is what it looks like if you paste it into a text editor:

@jim Oh yeah true ig. But maybe the ai sees it like this

(This might be a self report of me being dumb)

@Bayesian The AI sees it in tokens. Something like this:

Tokens: ['\n', ' ', ' o', ' ', ' o', '__', ' ', 'o', ' ', ' o', '', ' ', 'o', '', '/_', ' ', ' o', ' ', ' o', ' ', ' ____', 'o', '__', ' ', 'o', '__', ' ', ' o', '__', ' ', 'o', ' \n', ' ', ' <|', '>', ' ', ' /', 'v', ' ', ' v', '\\', ' ', ' <|', ' ', ' v', ' ', ' <', '|\\', ' ', ' <|', '>', ' ', ' /', ' ', ' \\', ' ', ' /', ' ', ' \\', ' ', ' /', 'v', ' ', ' v', '\\', ' \n', ' ', ' /', ' \\', ' ', ' />', ' ', ' <', '\\', ' ', ' <', ' >', ' ', ' /', ' \\\\', 'o', ' ', ' /', ' \\', ' ', ' \\', 'o', '/', ' ', ' />', ' ', ' <', '\\', ' \n', ' ', ' o', '/', ' ', ' \\', 'o', ' ', ' o', '/', ' ', ' |', ' ', ' \\', 'o', '/', ' v', '\\', ' ', ' \\', 'o', '/', ' ', ' |', ' ', ' ', '\\', 'o', '_', ' \n', ' ', ' <|', '__', ' __', '|', '>', ' ', ' <|', ' ', ' ', '\\', '_', 'o', '__', ' ', ' o', '__', '/_', ' ', ' |', ' ', ' <', '\\', ' ', ' |', ' ', ' <', ' >', ' ', ' \\', '_\\', '__', 'o', '__', ' \n', ' ', ' /', ' ', ' \\', ' ', ' \\\\', ' ', ' |', ' ', ' |', ' ', ' /', ' \\', ' ', ' \\', 'o', ' ', ' /', ' \\', ' ', ' |', ' ', ' \\', ' \n', ' ', ' o', '/', ' ', ' \\', 'o', ' ', ' \\', ' ', ' /', ' ', ' <', 'o', '>', ' ', ' \\', 'o', '/', ' ', ' v', '\\', ' \\', 'o', '/', ' ', ' o', ' ', ' \\', ' ', ' /', ' \n', ' ', ' /', 'v', ' ', ' v', '\\', ' ', ' o', ' ', ' o', ' ', ' |', ' ', ' |', ' ', ' <', '\\', ' |', ' ', ' <|', ' ', ' o', ' ', ' o', ' \n', ' />', ' ', ' <', '\\', ' ', ' <', '\\', '__', ' __', '/>', ' ', ' /', ' \\', ' ', ' ', '\\', 'o', '_', '/_', ' ', ' /', ' \\', ' ', ' <', ' \\', ' ', ' /', ' \\', ' ', ' <', '\\', '__', ' __', '/>', ' \n', ' ']

It sees the whitespace and the newlines. So from that point it's just a matter of its intelligence.

edit: tbc it of course does not "see" the whitespace, but it has tokens which represent the different whitespace sizes, so a sufficiently clever LLM should be able to solve this

@jim Could we show it an image a well? It might have good visual understanding but bad sequential text to realign in ur head understanding, like a human

As in show it a screenshot of the benchmark image

@Bayesian an image of the ascii art is fine

Is there specific language that the prompt must adhere to, or can it include references like "each o is a head in one of the figures"?

@PanAnon that would be fine.

People are also trading

In what year will go be solved?

2052

By the end of what year will the next Millennium Prize Problem be solved?

People are also trading

People are also trading

Related questions