Will DALL-E 3 correctly respond to prompt 1 from the Scott Aaronson/Gary Marcus/Earnest Davis paper?
Standard
66
Ṁ8583
resolved Jan 1
Resolved
NO

This paper. Prompt 1 is:

a red basketball with flowers on it, in front of blue one with a similar pattern

At least half of the generated images must be correct. I'll only try it once.

Get
Ṁ1,000
and
S1.00
Sort by:
predicted YES

FYI I just got 9 out of 10. I'm surprised how much DALL-E 3 seems to fluctuate week to week. I expected either steady increasing or decreasing over time.

predicted NO

@Jacy lol I got arguably one out of ten yesterday. Are you passing the prompt verbatim or is ChatGPT writing the DALL-E 3 prompt?

I tried both a bunch yesterday and maybe saw one correct, if I'm generous.

predicted YES

@chrisjbillington I think the criteria pretty clearly imply that DALL-E 3 should receive the prompt, not ChatGPT, so I've been doing that. I've tried a little to ChatGPT but I'm actually not even sure if that's better/worse.

This is basically the average image I get. Sometimes it fails by having no pattern on the blue basketball. Sometimes it is ambiguous because the back basketball has blue and red non-floral surfaces.

predicted NO

@Jacy very nice. This shouldn't matter, but are you using the specific DALL-E 3 mode, or regular, multimodal, ChatGPT? i.e. which of these two?

I'd be surprised if it affects the actual images, but I'm at least seeing that DALL-E 3 mode can generate multiple images by default, whereas ChatGPT defaults to one only (though I'm sure you could prompt-engineer either into generating however many, up to the four I think the DALL-E 3 API call actually supports)

predicted YES

@chrisjbillington I'm using the first. Is the DALL-E in that menu just a "GPT" of the sort OpenAI has been advertising, so it's just a ChatGPT with text prepended that makes it always call DALL-E? I ask because it's in the "Create a GPT" menu.

The multiple images thing is a good point. I haven't tried making ChatGPT call DALL-E multiple times, though it was doing two by default a few weeks ago. As far as I know, all of these DALL-E calls work the same way (e.g., have the same likelihood of success with this prompt), but I haven't tested.

predicted NO

@Jacy

so it's just a ChatGPT with text prepended that makes it always call DALL-E?

Yes, though actually its ChatGPT that has more info in its system prompt than DALL-E mode. DALL-E is a "custom GPT", and its system prompt says so:

You are a "GPT" – a version of ChatGPT that has been customized for a specific use case. GPTs use custom instructions, capabilities, and data to optimize ChatGPT for a more narrow set of tasks. You yourself are a GPT created by a user, and your name is DALL·E. Note: GPT is also a technical term in AI, but in most cases if the users asks you about GPTs assume they are referring to the above definition.

But its custom instructions are blank! I guess it just infers that it's supposed to generate images more often because dalle is the only tool is has listed as available in the tools section of its system prompt. Whereas ChatGPT has dalle, browser, and python.

Here are the full system prompts if you're interested:

https://gist.github.com/chrisjbillington/80e20181787e48720a815aee876d5668

Looks like DALL-E mode has been told to generate only two images, even if the user requests more, and ChatGPT mode to only generate one image even if the user requests more. When I asked it (in DALL-E mode), it seemed confused about this point - maybe there was some info in its training data suggesting four images was possible. Anyway looks like the limit for now is one in ChatGPT mode and two in DALL-E mode. Possibly the OpenAPI API may be different still.

@Jacy How do I get access to pure DALL-E 3, without the ChatGPT wrapper? I don't see an option for that.

predicted NO

@IsaacKing There's no access that doesn't go through ChatGPT, not even via the DALL-E 3 API. But you can write things like "pass the following prompt directly to DALL-E 3, do not modify it in any way" and verify that it complied by clicking on the image and viewing the prompt it used, in the information pane.

predicted YES

@IsaacKing glad you got it figured out. If you want to share how many images you plan to test, that's the big uncertainty I still have about pricing these markets.

@Jacy Just one image, since that's how many a single call generates. Looks like it's pretty easy to get ChatGPT to pass a prompt verbatim to DALL-E, so I'll do that.

@IsaacKing Ah, that's a lot of noise! Thanks.

predicted YES

Tested today on mobile, and it works, both on DALL-E directly (the app still has the option), or sending the prompt verbatim to chatGPT: “a red basketball with flowers on it, in front of blue one with a similar pattern”

Can you clarify what you will consider to be a correct answer? The prompt isn’t very clear to me: is the red basketball supposed to have actual flowers on it, and the blue one just a pattern printed on it, or both basketballs are supposed to have a flower pattern printed on them?

@SantiagoRomeroBrufau Any valid interpretation of the prompt is fine.

@IsaacKing whilst you're here, are LLM embellishments of the prompt allowed, or will you ask chatgpt to pass the exact prompt to DALL-E 3 unmodified (and view the prompt it used to verify whether it complied)?

@chrisjbillington Hmm, I'm not sure. What do you think would make the most sense?

predicted NO

@IsaacKing I don't think it matters in practice, since unlike e.g. the "can DALL-E 3 do arithmetic" question, there isn't really anything ChatGPT can do to let DALL-E "cheat" for this one. But a literal reading of the question suggests it's DALL-E 3 that is being tested which supports passing the prompt verbatim.

predicted YES

@chrisjbillington Yes, I don’t think embellishments should be allowed, as they can result in changes in either direction. Prompt 1 from the paper is the prompt, verbatim, not “a similar-ish prompt that explains this idea after passing through another model or through a person’s interpretation.”

@IsaacKing How many tests will you run? Each time you pass a prompt explicitly to DALL-E through ChatGPT, it returns only one image (not the same image every time FYI).

predicted NO

@Jacy

I'll only try it once.

@chrisjbillington

> "At least half of the generated images must be correct."

predicted NO

@Jacy Yes, he'll prompt once, and out of however many images that are generated (used to be four, is currently one), at least half must be correct.

predicted YES

@chrisjbillington shrug That's one take. I think it's ambiguous because the expectation was that one prompt would generate multiple images. So is it one try of generating a few images, or one try of entering the prompt into the system? Producing a few images seems more useful to me because it makes the market less noisy with minimal cost, and AI capabilities seem to be the natural target, rather than whatever number of images OpenAI happened to think is the best balance of user experience and financial cost.

predicted NO

@Jacy At the time this question was written, it wasn't known how many images we would get, so the requirements had to be compatible with any number. If Isaac had written "of the four images, at least two must be correct", then we'd have a problem. But Isaac leans toward literalism and more than half of one is one, so there isn't a problem. Given we didn't know how many images would be generated, from the start this question was a bet on that as much as anything else.

One remaining problem however is whether Isaac will test with the default ChatGPT session, or the DALL-E 3-specific session. The latter may generate more images than the former (two, in the test I just did).

I am also happy to provide a prompt that I think will cause chatgpt to call DALL-E 3 and generate four images (this is still a single API call from ChatGPT's perspective). This would be compatible with the resolution criteria IMHO and if Isaac agreees to it, so be it. (I'll test that it still works when I'm at my computer).

Not talking my book here. I just think the criteria are unambiguous about a single prompting, and a reduction in the default number of images isn't enough to change that, even if you're not a literalist, given that the criteria were written without knowing it would be four rather than one in the first place.

predicted YES

@chrisjbillington your perspective makes sense. Thanks for sharing.

@Jacy So I tested in the DALL-E 3-specific ChatGPT session a bit more, and these things appear to be true at the moment:

  • It is told in its system prompt to default to generating two images

  • The API docs it has in its system prompt indicate the DALL-E 3 text2im function supports up to four images (edit: this is false, see more recent comment chain above).

  • It will happily call that function with n=4 if you ask it to make four images

  • Nonetheless only two images are returned.

    But you can imagine that last bit might be a temporary limitation, these things seem to change based on demand.

There's also the option of calling the DALL-E 3 text2im function via the OpenAI API, through which you could prompt DALL-E 3 directly without going through ChatGPT (edit: this is false, ChatGPT still gets a chance to transform your prompts even if you use the API). Perhaps it generates four images by default, or at least might do it if you set n=4.