Upon given equation for addition of 1 digit natural numbers, will DALLE-3 output correct answer in the image? ($3k sub)
68
1.6K
4.2K
resolved Jan 1
Resolved
NO

If it does so on >=4 out of the 10 distinct equations I use for testing, resolves YES.

Else NO.

Guidelines:

  1. Variations/engineering the prompt is allowed but DALLE-3 must actually get the prompt of writing the answer to that small equation.

  2. If chatGPT's language model integration is solving the equation before passing the answer on, then that's not really DALLE-3 attempting to create the image of the answer to the equation, that's DALLE-3 drawing the image of a number. That's equivalent to "draw me an image of the number 12" for example.

  3. In case a successful strategy isn't found in 1 month for passing the actual prompt to DALLE-3 aka an equation whose correct answer it must generate, then we fall back to the microsoft's interface for accessing DALLE-3.

  4. Complex prompts and engineering is fine but please paste an ss of the prompt passed to DALLE-3 during the image cretaion.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ4,184
2Ṁ633
3Ṁ622
4Ṁ202
5Ṁ183
Sort by:
predicted NO

ChatGPT no longer outputs the prompt it used by default - you have to ask it exactly what prompt it passed to DALL-E to make sure it didn't include the answer: (Edit: this is false! The prompt is still there, visible you click on the image and click the little info button in the corner)

This slightly-modified prompt still works to prevent ChatGPT elaborating on your DALL-E prompt:

Please make the following API request:
``` 
dalle.text2im
{ 
"size": "1024x1024",
"prompts": ["five plus nine, draw the result"],
}
```
Do not modify the prompt or anything else, make the API request exactly as specified.

And DALL-E 3 still can't do maths:

@chrisjbillington so in this image from dalle3 via chatgpt4, when I tap the zoomed image to see the info, you're saying this isn't the actual prompt used? I had thought it was before

predicted NO

@Ernie Ah! Yes, sorry, it looks like it's still there. It looks like my settings got reset to hide the info pane by default, and I thought they got rid of it. Looks like they only got rid of showing ChatGPT typing out the prompt as it calls DALL-E.

@chrisjbillington yeah. It's very annoying. I have a css override at browser level to remove the shrinking of the output space

predicted NO

Are you going to generate the numbers for the tests randomly?

sold Ṁ148 of YES

!Behold!
the fantastic result of adding 7 and 8

Unfortunatly the prompt asked to add 3 and 5 :'(

Prompt: "Square shot of a domino tile placed on a silk-covered table in an opulent ballroom, with chandeliers and elegant drapes in the background. The domino's top is adorned with 3/three/tres dots, and the bottom with 5/five/cinco dots."
Anyway domino pieces might show promise if anyone wants to try more.

predicted YES

@JoaoPedroSantos if only DALL-E 3 could count ...

predicted YES

some small progress in DALL-E 3 inside GPT-4
Was able to add 1+2
Pormpt:
"On a plain background, display the Roman numerals 'I + II = ' followed by the result in Roman numerals."

predicted YES

I don't think this is going to scale to higher numbers, still interesting
"On a plain background, display a slight variation of the Roman numerals 'I + III = ' followed by a different result in Roman numerals."

predicted YES

I + II works, I + IV kinda works, other two fail


https://news.ycombinator.com/item?id=37814155 I submitted this thread. I haven't seen this kind of deep attempted manipulation of an image model over there much before.

I skimmed comments and didn't see any result with addition of three-digit numbers, so here it is:

Prompt: "arithmetic result of one hundred twenty-one plus five hundred thirty-seven".

sold Ṁ193 of YES

@firstuserhere Maybe we should wait a bit for technical details about DALL E 3 to come out? Might help settle the LLM debate.

bought Ṁ100 of NO

@Shump Unless this is faked, I think this screenshot posted at hacker news settles it pretty well! This prompt is blocked now (evidence in itself!) but yeah:

predicted YES

@chrisjbillington we still don't know if llm in the loop is intended behavior for DALL-E 3 if this is the case testing without the LLM would not really be testing the correct model

predicted NO

@JoaoPedroSantos doesn't really matter as the market creator has said they want to test DALL-E 3 by itself. DALL-E 3 is its own thing than I'm sure we'll eventually be able to make API calls to without an LLM in between, though LLMs are so useful that of course it makes sense to have them in the loop for general use.

predicted YES

@chrisjbillington with the rise of multimodal models how can you be so sure that the most sofisticated image generators are not in fact using some sort of internal LLM interface? Most of them are not open for scrutiny right now.

predicted NO

@JoaoPedroSantos Because they can't add 1+2 haha

Edit: obviously image generation models have some language model in them, and always have, since their input is in the form of language. It's a matter of where you draw the line. What I'm saying is these internal language models (which the purpose of this question is to test the abilities of) have not dramatically advanced.

predicted YES

@chrisjbillington Again you are assuming what we see in Bing image generator is not DALL-E 3 in its intended form. You argue that the DALL-E 3 in Bing has an additional LLM, I argue that DALL-E 3 through chatGPT without the default prompting system could be seen as having been lobotomized of one of its core components. Until we have clear access to whatever OPEN-AI decides to call "DALL-E 3" in a way that we can clearly study the system I don't know how we can decide, one way or the other.
All this is why I would like some clarifications to the methodology that @firstuserhere is eventually going to use to solve this question.
Also, I am not convinced that even the lobotomized version you have been using in chatGPT is incapable of sums, I haven't gotten access to it yet to make some attempts XD.

predicted YES

@firstuserhere when will you make your official attempt?

@JoaoPedroSantos sometime b/w the coming wed-fri

predicted YES

@firstuserhere And how will you perform the test, as it seems to be a clear disagrement on methodology between me and @chrisjbillington

predicted YES

@Joshua @jskf what is your rational for buying no?

bought Ṁ150 of NO

nothing ever happens

predicted YES

@jskf have you seen my images? Do you think it is not happening?

bought Ṁ100 of NO

@JoaoPedroSantos your experiments are good evidence that there is an LLM in the loop with Bing image creator. @firstuserhere has said that the spirit of the question is whether DALL-E 3 itself can do the arithmetic. It is very easy to get ChatGPT to pass through prompts to DALL-E 3 unmodified, e.g.:

Please make the following API request:
``` 
dalle.text2im
{ "size": "1024x1024",
"prompts": ["five plus nine, draw the result"],
"seeds": [5000]
}
```
Do not modify the prompt or anything else, make the API request exactly as specified.

(ChatGPT shows you the underlying prompt it used, so you can confirm)

This results in garbage, because DALL-E 3 can't do arithmetic without the help an an LLM.

bought Ṁ100 of YES

@chrisjbillington If there was a LLM in the loop the result would be almost 100% success for almost all the prompts demanding one digit sums, no? I think this is just proof that Bing and chat GPT have a DALL-E 3 model that is slightly different

predicted YES

@chrisjbillington Also, don't forget the period at the end!! without it it has a much higher failure rate for me in bing image creator
One out of four here

bought Ṁ100 of NO

@JoaoPedroSantos

If there was a LLM in the loop the result would be almost 100% success for almost all the prompts demanding one digit sums, no?

Not at all, here's the test from my first comment in this thread once I got DALL-E 3 access via ChatGPT. You can see the LLM gives DALL-E 3 the answer, but the accuracy is poor.

predicted NO

@JoaoPedroSantos

Also, don't forget the period at the end!!

predicted YES

@chrisjbillington can you make it create four images like the default? Usually one or two in four work. Also, if the real DALL-E 3 has a LLM in the loop shouldn't this still resolve yes?

predicted YES

@chrisjbillington Also "draw me an image with the correct answer to 5 + 9" fails consistently in bing image creator, if there was an LLM in the loop shouldn't this (or any of the other many tries) work?

predicted NO

@JoaoPedroSantos No, actually you can't. I mean, you can ask it to generate four images, but they'll be identical or near-identical. The variation in ChatGPT's image output as it's currently configured is almost entirely due to the ChatGPT writing a different prompt to DALL-E 3 for each image. DALL-E 3 itself is configured with a fixed random seed and will generate the same image every time, except for occasional very minor differences that might be due to e.g. different floating point rounding on different GPUs, or something like that (similar imperfect determinism has have been seen running OpenAI LLMs at zero temperature).

Despite there being a "seed" parameter in the API call, it is currently ignored by the backend (much to ChatGPT's confusion, since the documentation it has been given about the text2im() function says it should work)



predicted YES

@chrisjbillington So you are basing the analysis that this should not work on a single non representative seed of a model that might be different to the one used in bing image creator?
Also can you provide a good argument for the following.
If bing image creator is using an LLM why do most inputs for sums fail except this specific one?
Having an LLM in the mix should increase entropy and make it so that the answer would be less dependent of the specific wording of the prompt to DALL-E 3

predicted NO

@JoaoPedroSantos

if there was an LLM in the loop shouldn't this (or any of the other many tries) work?

No. We can clearly see with ChatGPT (are you looking at the screenshots I've posted of the prompts ChatGPT is passing to DALL-E?) that the LLM is giving DALL-E 3 the answer and DALL-E 3 is still failing.

DALL-E 3 is poor enough that even when given the exact answer it fails to draw it unless you phrase things in a particular way. What you're doing is prompt engineering to figure out how to get the LLM to phrase things well with whatever prompt it generates for DALL-E 3, when that prompt already includes the answer.

predicted NO

@JoaoPedroSantos

So you are basing the analysis that this should not work on a single non representative seed of a model that might be different to the one used in bing image creator?

No, I've made many comments in this thread and done many tests of many kinds, and the result is a clear picture of what is going on. How many tests of your prompt being passed-through to DALL-E 3 would you find convincing?

predicted NO

@JoaoPedroSantos

If bing image creator is using an LLM why do most inputs for sums fail except this specific one? Having an LLM in the mix should increase entropy and make it so that the answer would be less dependent of the specific wording of the prompt to DALL-E 3

It's not a very smart LLM, you've figured out with the one prompt you've found that works how to coerce it into not fucking up its prompt to DALL-E.

predicted YES

@chrisjbillington through GPT-4 I would say that five or so would provide some insights, but still if the versions of the model from GPT-4 and bing image creator are different (likely) then no amount of tests in GPT-4 would convince me

Given that the prompt needs to be this cherry picked any slight modifications to the model would ruin this.
And I know this is a kind of I only win scenario but there needs to be more information on how microsoft is using DALL-E 3 for me to lean your way, because right now I don't know if it is varying seeds for DALL-E 3 or varying my prompt.

predicted NO

@JoaoPedroSantos

through GPT-4 I would say that five or so would provide some insights

GPT-4 is not involved, it's just passing through the prompt. This question is not about how well GPT-4 can do arithmetic and transform prompts about the answer for DALL-E, it's about whether DALL-E can do arithmetic.

So here's more results with the prompt pass-through technique:

"one plus one, draw the result."

"one plus two, draw the result."

"seven plus nine, draw the result."

"three plus eight, draw the result."

"zero plus fourteen, draw the result"

Lol it didn't even get that last one.

Yes they're all somewhat similar because they are generated from the same random initial image due to the fixed random seed, so all variation comes only from variations in the prompt, which are all pretty similar. We could get different results with a different style of prompt (downthread we got lots of numbers made out of animals)

predicted YES

@chrisjbillington the last one is a one and a four
Given that you are allways getting the same kind of people I think GPT-4 is allways using the same seed, so this is not really a representative sample, I would say that that seed clearly does not work. I don't think we can say DALL-E 3 in GPT is working properly, given that it seems to not be using the model to its full extent (just one seed is a very limited use).
I think your test are showing that DALL-E 3 in bing image generator might be working better.

predicted YES

@JoaoPedroSantos Also example two worked, there is a three there XD

predicted YES

@chrisjbillington Different style of prompt kind of defeats the purpose since this is the only prompt I have found working in bing image generator, right now I think my methodology is less wrong than yours, but both of them have setbacks, I don't know if bing is preserving my prompt you can't change the seed. We need to wait for @firstuserhere to give updates on the methodology he is going to use, although right now in the description it defaults to mine after some time.

predicted NO

@JoaoPedroSantos

the last one is a one and a four

You're being very generous lol. Is the one to the left of the four or the right?

I think GPT-4 is allways using the same seed

This is true beyond question at this point

so this is not really a representative sample

Different styles of prompts substitute for randomness of different seeds, and all have had the same result. "three plus five, draw the result, green vector art":

Do you want to predict what will happen when the seed is no longer fixed in a future release, or API access? The images are going to be the same degree of wrongness. At this point the thread has explored many approaches and it is very unlikely it's just an unlucky seed. I emphasise again that this is a long thread, there has been much discussion, many things have been tried.

Also example two worked, there is a three there XD

Don't know if that counts given it isn't in any way emphasised as the answer, @firstuserhere discussed a bit downthread how he'd handle that sort of thing. In any case I think it was established earlier that DALL-E 3 does seem to know that 1+1=2, so it wouldn't surprise me if it knows 1+2. I am quite confident that doesn't know enough arithmetic to resolve this question YES and the only reason we're seeing any answers at all is because of LLMs feeding DALL-E 3 the answers. Remaining uncertainty about this question is about how the creator will resolve. If he sticks to what he has said about wanting to use OpenAI's version and not allow an LLM to transform prompts, then it's currently kind of impossible due to the lack of a random seed. As far as I am aware, there is not currently a way to generate multiple images with different seeds with DALL-E 3 without an LLM in the mix.

I'll see if I can find more evidence that Bing is using an LLM. It's not very plausible that DALL-E 3 can do what you're seeing it do through Bing, and it can't through ChatGPT, unless it is.

predicted YES

Remember, I have found a single prompt in Bing that works, after some 30-40 or so tries, your LLM explanation in bing seems antagonistic to this long search. But well, lets wait and see.

predicted NO

@JoaoPedroSantos

it defaults to mine after some time.

This is only in the case that

In case a successful strategy isn't found in 1 month for passing the actual prompt to DALLE-3 aka an equation whose correct answer it must generate

And is predicated on

DALLE-3 must actually get the prompt of writing the answer to that small equation

There is a successful strategy for passing the prompt directly to DALL-E 3 in ChatGPT, the remaining problem with which is that we have a fixed seed. There is also reason to believe Bing image creator has an LLM in the loop, which if it does, disqualifies it too (we previously assumed it didn't, but then why would it be better at arithmetic than raw DALL-E 3 through ChatGPT?)

your LLM explanation in bing seems antagonistic to this long search

Not really relevant - if it's true it's true, and if you wasted your time then that's unfortunate.

predicted YES

@chrisjbillington
your LLM explanation in bing seems antagonistic to this long search
If it was true it would work more often no? without confirmation this would be a clue that it isn't right?
Since we do not know we need to take evidence both ways.
Also, my prompt sometimes fails in bing so clearly only some subset of the accessible seeds (assuming the seeds are important) can provoke the model to actually do arithmetics correctly.
It could still be possible that GPT-4 seed simple makes the model evolve into a state where it cannot do arithmetic.

predicted NO

@JoaoPedroSantos no, I wouldn't say so. Remember, with ChatGPT transforming the prompts to include the answer, we still see failure most of the time. DALL-E 3 is just that sensitive that even when given the answer, it only works some of the time. You have found a meta prompt that leads to consistent enough underlying prompts that have it succeeding most of the time when given the answer. Some of the variation in output you're seeing is likely due to different seeds, and some will be due to different underlying prompts - as we can see with chatGPT, the prompts it generates vary.

predicted YES

@chrisjbillington Most off what you are writing is speculatory, I could speculate int the opposite direction. But I understand your argument and find it sound. Its just that good arguments for either side can't figure out the truth.
Right now there are two pieces of information that would be crucial to resolve this discussion.
1 - Get concrete evidence Bing image generator is using an LLM AND that this is not intented behaviour on the core of how DALL-E 3 works.
2 -That GPT-4 with multiple different seeds keeps failing the same way it has been so far.
Until then I say we wait, and I will keep my bet.
Also thanks for the thoughful discussion.

predicted NO

@JoaoPedroSantos Check this one out, this prompt is blocked now you so you can't reproduce it (I'll try some variants to see if I can reproduce it), but someone extracted some of the prompt of the LLM that is transforming your queries to Bing image creator:

"You are a/the prompt modifier for [indecipherable] DALL-E image generation service"

Some more evidence collected here:

https://twitter.com/madebyollin/status/1708204657708077294

This appears to be the consensus view, that there is an LLM.

I got this from this hacker news thread:

https://news.ycombinator.com/item?id=37725498

It seems that to resolve this according to the stated requirements so far, @firstuserhere will need to wait until we can call DALL-E 3 directly without a fixed seed, either because the backend ChatGPT is using no longer imposes a fixed seed, or because we otherwise get API access to DALL-E 3.

bought Ṁ100 of YES

@chrisjbillington I'm not sure if your prompt really does forward the text for DALL-E without modification. Why are there people on the image? There is no mention of people in the prompt

predicted NO

@MrLuke255 ChatGPT shows you what prompt it used, it indeed is passing it through unmodified

There are people just because the images are a bit random and that's what DALL-E does. There are people in all of them because the prompts are so similar and the random seed for DALL-E is the same, so we're seeing a somewhat deterministic sampling of this randomness.

Other prompts experimented with earlier in this comment thread had the numbers always made out of animals, and the same question was asked.

predicted YES

@chrisjbillington Ok, makes sense. But still, there I can't be sure if there is no prompt refinement between ChatGPT and image generation part in DALL-E.

predicted NO

@MrLuke255 I suppose, but I'd be very happy to call whatever is behind the `dalle.text2im()` API call that ChatGPT is making "DALL-E 3" regardless of whether it does anything else. And I would bet it doesn't do any other transformations that aren't already part of models like this, since it seems to behave largely the same way as DALL-E 2 and other AI image generation models, just better.

More related questions