When will AI pass @jim's "agents benchmark"?
25
1kṀ4652
2030
1%
2024
40%
2025
33%
2026
12%
2027
4%
2028
3%
2029
3%
2030
4%
Other

Resolves to the year during which the "agents benchmark" is first solved.

The benchmark involves an AI being given the ASCII art shown below and being asked to colour each of the depicted figures in a different colour. If an AI succeeds at least half the time it is considered to have passed the benchmark. The output should be HTML or HTML and CSS.

The prompt must consist of no more than two English language sentences (along with the ASCII art itself).

The AI must have a pass rate of at least 50% for the solution to qualify.

         o           o__ __o       o__ __o__/_   o          o   ____o__ __o____    o__ __o    
        <|>         /v     v\     <|    v       <|\        <|>   /   \   /   \    /v     v\   
        / \        />       <\    < >           / \\o      / \        \o/        />       <\  
      o/   \o    o/                |            \o/ v\     \o/         |        _\o____       
     <|__ __|>  <|       _\__o__   o__/_         |   <\     |         < >            \_\__o__ 
     /       \   \\          |     |            / \    \o  / \         |                   \  
   o/         \o   \         /    <o>           \o/     v\ \o/         o         \         /  
  /v           v\   o       o      |             |       <\ |         <|          o       o   
 />             <\  <\__ __/>     / \  _\o__/_  / \        < \        / \         <\__ __/>  

  • Update 2025-02-16 (PST) (AI summary of creator comment): 333 Characters Limit Update

    • The prompt (excluding the ASCII art) must contain no more than 333 characters in total.

    • It must still consist of no more than two English language sentences.

Get
Ṁ1,000
to start trading!
Sort by:

With this prompt:

"This ASCII art spells out "AGENTS". Can you output HTML or HTML / CSS to color each letter in a different color?"

The arena version of Grok 3 gave me this:

It's just writing its own ASCII taking loose inspiration from the one in the prompt (and it butchered some of the letters), but I wonder if Grok-3-reasoning is capable of this out of the box? Anyone tried it?

@SaviorofPlant Turns out Grok-3-reasoning is free right now. It does a decent job, not perfect though

HTML: https://pastebin.com/bbwuam1Y
Full response with reasoning: https://pastebin.com/qbC4DrWp

Betting more on 2025, I think Claude 3 reasoning or whatever further improvements on this paradigm come later this year can likely solve this

reposted

Good market

The prompt must consist of no more than two English language sentences (along with the ASCII art itself).

What if the two English sentences are something like "Ignore the ASCII art below and write some HTML that will display the following. You should display 9 spaces, then a red lowercase o, then 11 spaces, then an orange lowercase o, then two orange underscores, then a space..."

Maybe you should add a character limit?

@CDBiddulph Shenanigans disallowed

@CDBiddulph 333 characters limit. And, in general, no shenanigans.

bought Ṁ150 YES

There should imo be a modifications to the edges of the prompt to make it clear where the leftmost char of each row is and where the rightmost char of each row is. Otherwise even a human would have a very hard time

opened a Ṁ5,000 NO at 50% order

@Bayesian I think the whitespace and newlines make it clear? This is what it looks like if you paste it into a text editor:

@jim Oh yeah true ig. But maybe the ai sees it like this

(This might be a self report of me being dumb)

@Bayesian The AI sees it in tokens. Something like this:

Tokens: ['\n', ' ', ' o', ' ', ' o', '__', ' ', 'o', ' ', ' o', '', ' ', 'o', '', '/_', ' ', ' o', ' ', ' o', ' ', ' ____', 'o', '__', ' ', 'o', '__', ' ', ' o', '__', ' ', 'o', ' \n', ' ', ' <|', '>', ' ', ' /', 'v', ' ', ' v', '\\', ' ', ' <|', ' ', ' v', ' ', ' <', '|\\', ' ', ' <|', '>', ' ', ' /', ' ', ' \\', ' ', ' /', ' ', ' \\', ' ', ' /', 'v', ' ', ' v', '\\', ' \n', ' ', ' /', ' \\', ' ', ' />', ' ', ' <', '\\', ' ', ' <', ' >', ' ', ' /', ' \\\\', 'o', ' ', ' /', ' \\', ' ', ' \\', 'o', '/', ' ', ' />', ' ', ' <', '\\', ' \n', ' ', ' o', '/', ' ', ' \\', 'o', ' ', ' o', '/', ' ', ' |', ' ', ' \\', 'o', '/', ' v', '\\', ' ', ' \\', 'o', '/', ' ', ' |', ' ', ' ', '\\', 'o', '_', ' \n', ' ', ' <|', '__', ' __', '|', '>', ' ', ' <|', ' ', ' ', '\\', '_', 'o', '__', ' ', ' o', '__', '/_', ' ', ' |', ' ', ' <', '\\', ' ', ' |', ' ', ' <', ' >', ' ', ' \\', '_\\', '__', 'o', '__', ' \n', ' ', ' /', ' ', ' \\', ' ', ' \\\\', ' ', ' |', ' ', ' |', ' ', ' /', ' \\', ' ', ' \\', 'o', ' ', ' /', ' \\', ' ', ' |', ' ', ' \\', ' \n', ' ', ' o', '/', ' ', ' \\', 'o', ' ', ' \\', ' ', ' /', ' ', ' <', 'o', '>', ' ', ' \\', 'o', '/', ' ', ' v', '\\', ' \\', 'o', '/', ' ', ' o', ' ', ' \\', ' ', ' /', ' \n', ' ', ' /', 'v', ' ', ' v', '\\', ' ', ' o', ' ', ' o', ' ', ' |', ' ', ' |', ' ', ' <', '\\', ' |', ' ', ' <|', ' ', ' o', ' ', ' o', ' \n', ' />', ' ', ' <', '\\', ' ', ' <', '\\', '__', ' __', '/>', ' ', ' /', ' \\', ' ', ' ', '\\', 'o', '_', '/_', ' ', ' /', ' \\', ' ', ' <', ' \\', ' ', ' /', ' \\', ' ', ' <', '\\', '__', ' __', '/>', ' \n', ' ']

It sees the whitespace and the newlines. So from that point it's just a matter of its intelligence.

edit: tbc it of course does not "see" the whitespace, but it has tokens which represent the different whitespace sizes, so a sufficiently clever LLM should be able to solve this

@jim Could we show it an image a well? It might have good visual understanding but bad sequential text to realign in ur head understanding, like a human

As in show it a screenshot of the benchmark image

@Bayesian an image of the ascii art is fine

Is there specific language that the prompt must adhere to, or can it include references like "each o is a head in one of the figures"?

@PanAnon that would be fine.

Related questions

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules