When DALL-E 4 is released and becomes available to my personal Plus account, I will do the following:
Go to the front page of New York Times
Pick 10 different articles (such as this one)
Copy the first 10 words from each article (for example, "When Microsoft announced that it was spending $69 billion to")
Ask DALL-E 4 to generate images with the following prompt: "Create an image of a person holding a sign that says '<10 words>'". I will also try prompts using double quotes instead of single quotes, as well as prompts skipping the "Create an", if that improves the final image quality.
If DALL-E 4 generates 4 images per prompt by default, I will take the 4 images generated by each prompt. If it generates less, I will spend additional tokens generate more images. If it generates more than 4 by default, I will only take the first 4 for each prompt.
Resolution criteria: if out of the 40 generated images at least 20 contain the exact words requested without a single spelling mistake, this question will resolve to Yes. Otherwise it will resolve to No.
Additional clarifications:
Punctuation must be preserved exactly as requested. I.e. if if the prompt is "Jack, Sally, John ..." and the output is "Jack Sally John", then that image will count as a failure (and vice versa).
All special characters (such as the dollar sign) must be preserved as well
If the first 10 words of an article contains words written in an alphabet other than English (i.e. Путин), I will choose a different article intead.
There shouldn't be additional words added to the beginning or the end of the requested text.
If DALL-E refuses to generate an image because it finds the text to be inappropriate for some reason, I will copy the first 10 words from a different article.
If a sequel to DALL-E 3 is not released by Oct 20th 2028, this will resolve as N/A
If a sequel is released but it gets a different name (i.e. Bing Image Generator), then I will use that sequel for the purposes of this market.
If an intermediate version is released (i.e. DALL-E 3.5) that doesn't pass this test, I will wait until either version 4 or a clearly denoted sequel is released before resolving the market.
If there's an update to DALL-E 3 or an intermediary version before version 4 that successfully passes this test, I will resolve the market early.
If the New York Times shuts down by the time DALL-E 4 is released, I will use another major newspaper for the source of prompts.
The prompt will be inserted exactly as described, with no context, no custom instructions and a fresh chat window. No additional instructions will be provided even if Dall E 4 tries to modify the prompt before generating the images.
If there's a caching mechanism involved and we can't get 4 unique images per prompt, I will generate more images based off additional NYT articles until we have 40 unique images to evaluate.
I'll use the regular ChatGPT interface with the Dall E 4 mode enabled to evaluate the results, even if ChatGPT injects its own modifications into the process.
If ChatGPT shuts down in favor of Bing Chat or some other product, I'll use that interface instead.
I will use the "you know it when you see it" principle when evaluating ambigous edge cases of what constitutes a "mistake".
For reference, this is what DALL-E 3 (via Bing) generates as of today:
@traders big news! GPT-4o (with DALL-E 3 in the back) was able to generate a valid image for the very first time for: Create an image of a person holding a sign that says 'When Microsoft announced that it was spending $69 billion to'
. Success rate was just 1/10, but this is still a major breakthrough as I was never able to get a perfect image generated on prior attempts.
Valid result:
Invalid results:
Update from SD3, still not perfect: https://manifold.markets/DanMan314/will-stable-diffusion-3-be-consiste#.
Imagen-2 attempts for this prompt. Still far from perfection: https://imgur.com/a/l9nN31k