Will DALL-E 4 be able to generate an image with 10 English words without making any mistakes, at least 50% of the time?
➕
Plus
38
Ṁ3573
2028
86%
chance

When DALL-E 4 is released and becomes available to my personal Plus account, I will do the following:

  1. Go to the front page of New York Times

  2. Pick 10 different articles (such as this one)

  3. Copy the first 10 words from each article (for example, "When Microsoft announced that it was spending $69 billion to")

  4. Ask DALL-E 4 to generate images with the following prompt: "Create an image of a person holding a sign that says '<10 words>'". I will also try prompts using double quotes instead of single quotes, as well as prompts skipping the "Create an", if that improves the final image quality.

  5. If DALL-E 4 generates 4 images per prompt by default, I will take the 4 images generated by each prompt. If it generates less, I will spend additional tokens generate more images. If it generates more than 4 by default, I will only take the first 4 for each prompt.

  6. Resolution criteria: if out of the 40 generated images at least 20 contain the exact words requested without a single spelling mistake, this question will resolve to Yes. Otherwise it will resolve to No.

Additional clarifications:

  1. Punctuation must be preserved exactly as requested. I.e. if if the prompt is "Jack, Sally, John ..." and the output is "Jack Sally John", then that image will count as a failure (and vice versa).

  2. All special characters (such as the dollar sign) must be preserved as well

  3. If the first 10 words of an article contains words written in an alphabet other than English (i.e. Путин), I will choose a different article intead.

  4. There shouldn't be additional words added to the beginning or the end of the requested text.

  5. If DALL-E refuses to generate an image because it finds the text to be inappropriate for some reason, I will copy the first 10 words from a different article.

  6. If a sequel to DALL-E 3 is not released by Oct 20th 2028, this will resolve as N/A

  7. If a sequel is released but it gets a different name (i.e. Bing Image Generator), then I will use that sequel for the purposes of this market.

  8. If an intermediate version is released (i.e. DALL-E 3.5) that doesn't pass this test, I will wait until either version 4 or a clearly denoted sequel is released before resolving the market.

  9. If there's an update to DALL-E 3 or an intermediary version before version 4 that successfully passes this test, I will resolve the market early.

  10. If the New York Times shuts down by the time DALL-E 4 is released, I will use another major newspaper for the source of prompts.

  11. The prompt will be inserted exactly as described, with no context, no custom instructions and a fresh chat window. No additional instructions will be provided even if Dall E 4 tries to modify the prompt before generating the images.

  12. If there's a caching mechanism involved and we can't get 4 unique images per prompt, I will generate more images based off additional NYT articles until we have 40 unique images to evaluate.

  13. I'll use the regular ChatGPT interface with the Dall E 4 mode enabled to evaluate the results, even if ChatGPT injects its own modifications into the process.

  14. If ChatGPT shuts down in favor of Bing Chat or some other product, I'll use that interface instead.

  15. I will use the "you know it when you see it" principle when evaluating ambigous edge cases of what constitutes a "mistake".

For reference, this is what DALL-E 3 (via Bing) generates as of today:

Get
Ṁ1,000
and
S3.00
Sort by:

Wow, FLUX.1 [dev] passed the test with flying colors, 3/4 images generated were valid, the other 2 were very close. This market is about DALL-E 4 so I won't run the full 40 images test, but this is very impressive regardless.

@traders big news! GPT-4o (with DALL-E 3 in the back) was able to generate a valid image for the very first time for: Create an image of a person holding a sign that says 'When Microsoft announced that it was spending $69 billion to'. Success rate was just 1/10, but this is still a major breakthrough as I was never able to get a perfect image generated on prior attempts.

Valid result:

Invalid results:

If I ask it for text alone its very close but still won't pass the 50% test

GPT-4o is getting close but still no cigar

Can't try it myself yet but Imagen 3 claims to have made significant progress on this problem:

Meta.AI didn't succeed either

DALL E 2 FREE 2024??

predicts YES

Midjourney 6.0 was about as good as DALL-E 4 for the sample prompt

predicts YES

Imagen-2 attempts for this prompt. Still far from perfection: https://imgur.com/a/l9nN31k

Nope, not without spelling mistakes. Just tested.

@DavidBolin Oops, did not notice this was about the next version :)

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules