Will DALL-E 4 be able to generate an image with 10 English words without making any mistakes, at least 50% of the time?

1kṀ4003

resolved Apr 25

Resolved

YES

ALL

When DALL-E 4 is released and becomes available to my personal Plus account, I will do the following:

Go to the front page of New York Times
Pick 10 different articles (such as this one)
Copy the first 10 words from each article (for example, "When Microsoft announced that it was spending $69 billion to")
Ask DALL-E 4 to generate images with the following prompt: "Create an image of a person holding a sign that says '<10 words>'". I will also try prompts using double quotes instead of single quotes, as well as prompts skipping the "Create an", if that improves the final image quality.
If DALL-E 4 generates 4 images per prompt by default, I will take the 4 images generated by each prompt. If it generates less, I will spend additional tokens generate more images. If it generates more than 4 by default, I will only take the first 4 for each prompt.
Resolution criteria: if out of the 40 generated images at least 20 contain the exact words requested without a single spelling mistake, this question will resolve to Yes. Otherwise it will resolve to No.

Additional clarifications:

Punctuation must be preserved exactly as requested. I.e. if if the prompt is "Jack, Sally, John ..." and the output is "Jack Sally John", then that image will count as a failure (and vice versa).
All special characters (such as the dollar sign) must be preserved as well
If the first 10 words of an article contains words written in an alphabet other than English (i.e. Путин), I will choose a different article intead.
There shouldn't be additional words added to the beginning or the end of the requested text.
If DALL-E refuses to generate an image because it finds the text to be inappropriate for some reason, I will copy the first 10 words from a different article.
If a sequel to DALL-E 3 is not released by Oct 20th 2028, this will resolve as N/A
If a sequel is released but it gets a different name (i.e. Bing Image Generator), then I will use that sequel for the purposes of this market.
If an intermediate version is released (i.e. DALL-E 3.5) that doesn't pass this test, I will wait until either version 4 or a clearly denoted sequel is released before resolving the market.
If there's an update to DALL-E 3 or an intermediary version before version 4 that successfully passes this test, I will resolve the market early.
If the New York Times shuts down by the time DALL-E 4 is released, I will use another major newspaper for the source of prompts.
The prompt will be inserted exactly as described, with no context, no custom instructions and a fresh chat window. No additional instructions will be provided even if Dall E 4 tries to modify the prompt before generating the images.
If there's a caching mechanism involved and we can't get 4 unique images per prompt, I will generate more images based off additional NYT articles until we have 40 unique images to evaluate.
I'll use the regular ChatGPT interface with the Dall E 4 mode enabled to evaluate the results, even if ChatGPT injects its own modifications into the process.
If ChatGPT shuts down in favor of Bing Chat or some other product, I'll use that interface instead.
I will use the "you know it when you see it" principle when evaluating ambigous edge cases of what constitutes a "mistake".

For reference, this is what DALL-E 3 (via Bing) generates as of today:

OpenAI

dall-e

DALLE4

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ342
2		Ṁ213
3		Ṁ62
4		Ṁ46
5		Ṁ43

People are also trading

Before 2026, Will DL systems outperform humans at describing a picture in words to make human mental images match it?

33% chance

Will AI extend English before 2032?

Sort by:

Didn't have time to do the 40 prompts but GPT-4o gives me no reason to doubt, so resolving to Yes to avoid holding this market 'hostage'.

Market is now closed as GPT-4o is arguably DALL-E 4, thanks to the name having a "4" in it and OpenAI making a (relatively) big deal out of the announcement

GPT-4o Image Generation nailed it in my limited run. I'll do a full run of 10 images with up to 4 attempts per image, but so far I think this will be a Yes. I'll post the full image collection when done.

@nsokolsky It's not DALL-E. The system card is very clear about this:

Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.

The argument from having a "4" in it doesn't seem coherent. The DALL-E N line of models is numbered independently of the GPT-N(o) line of models. They ended up with similar numbering because of when models were released, but this would be like comparing Gemini 2.5, Claude 3.7, and GPT-4.5 based on their numbers.

@Jacy an extra rule said that an intermediary version passing the test will resolve to Yes. It's basically a Yes, just been putting off doing the full 40 lol

Wow, FLUX.1 [dev] passed the test with flying colors, 3/4 images generated were valid, the other 2 were very close. This market is about DALL-E 4 so I won't run the full 40 images test, but this is very impressive regardless.

@traders big news! GPT-4o (with DALL-E 3 in the back) was able to generate a valid image for the very first time for: Create an image of a person holding a sign that says 'When Microsoft announced that it was spending $69 billion to'. Success rate was just 1/10, but this is still a major breakthrough as I was never able to get a perfect image generated on prior attempts.

Valid result: