41. Will an image model win Scott Alexander’s bet on compositionality, to Edwin Chen’s satisfaction, in 2023?
resolved Jan 9

See https://www.surgehq.ai/blog/dall-e-vs-imagen-and-evaluating-astral-codex-tens-3000-ai-bet . Scott and Edwin will try to get the top image models of late 2023 to try the specific questions in the bet. If we can’t access the models, then Edwin can use public demos of the image models and his own best guess to resolve this as either likely true, likely false, or unclear. Edwin believes current AI models have not won the bet, so if there is no clear progress he should resolve the bet false. If Edwin is unwilling to judge this, Gary Marcus will be used as the substitute; if neither of these two people will do it, the question resolves as unclear.

This is question #41 in the Astral Codex Ten 2023 Prediction Contest. The contest rules and full list of questions are available here. Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

I tried on my own to get the fox one. Here was my FIRST attempt:


@benshindel My first attempt today passed as well (1/2):

Second attempt also passed (1/2):

Third attempt passed (2/2):

Fourth attempt passed (1/2):

Fifth attempt passed (1/2):

Sixth attempt passed (1/2):

Seventh attempt failed (0/2).

Eighth attempt passed (2/2):

Ninth attempt failed (0/2).

Tenth attempt failed (0/2).

It's not outside the realm of possibility, but it seems to me that the odds that all ten images fail is somewhere in the vicinity of 0.5-1%, unless my implementation is somehow wrong.

I would say Chen assessment is fair, as there is still a compositionality issue with generative AI. Dall-e 3 has shown progress but is still falling difficult cases, Midjourney is way behind.

On the one hand, yes, 'are you kidding me.' On the other hand, as I said when I first bet, know your judge, this was always going to be a stickler, and I looked at the post and I agree that technically the test was failed.

@ZviMowshowitz I honestly think the two fox images should have passed. Like, you can always raise some narrow technical complaint along the lines of "the astronaut wasn't quite holding the fox" or "the fox's lipstick looked fake". Imagine if they'd all been acceptable, you can always get more fine-grained:
-the pixel art had non-square elements so it's not quite pixel art

-the building had machinery but wasn't explicitly a "factory"

-the cathedral was missing gothic architecture so it's just a church

etc, etc... i think it misses the point of the compositionality bet to argue about whether the lipstick was lipstick-y enough, for instance.

…are you kidding me. I accept that this is resolved fairly because the criteria gave Edwin Chen as the judge, but… Edwin Chen is a bad judge. I have run these prompts through DALL-E 3 and it pretty clearly passes the explicit conditions :/

@benshindel in my ACX prediction spreadsheet last year, I wrote this note down next to this question: "is Edwin Chen an asshole?" Clearly he is since he doesn't want to admit he is wrong, but this is the sort of question where this risk is easy to identify, so no sympathy for you!

@benshindel Well, you can see every step of the evaluation, and it is not only Edwin Chen, but 5 evaluators. Images are there, and why they fail. The only case that is a bit subjective is the farmer and it's red basketball. However this was one of the reason prompt was difficult, as most basketball are orange it's more difficult for a model to draw it in red. So it is fair to reject orange case, instead of saying orange is a kind of red.

Resolving NO as per link posted below.

Ugh that’s really annoying :/

📢Results within 1 Week from 1/4/2024 per Scott.


@SirCryptomind whoah. Can we not wait a week? I put hours and hours into this over the last year

@Ernie also isn't chickening out a type of losing? If someone challenges you to a bet then refuses to show up and make their case, they just lose

@Ernie I assume these are the same answers given to Metaculus. If he changes his mind, I am sure he will email me.

@Ernie I would agree with waiting if there's a chance someone here will be able to get in touch with Chen. Does anyone think they can?

Unfortunately, the question really is about what Chen thinks, so I don't see how it can resolve to a non-N/A answer without getting an answer from him. I guess I don't buy that a non-response counts as a YES.

Is there actually a bet at this point between the two?

@SirCryptomind Did @ScottAlexander already check with Gary Marcus? From description:

If Edwin is unwilling to judge this, Gary Marcus will be used as the substitute

@MartinRandall I am just simply the messenger here, I emailed him and did what I always do with his answers. I post thee email as proof and resolve and tell Metaculus if needed.

@SirCryptomind should resolve through a post:

Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

I would expect Scott to notice his backup plan while writing a post but perhaps not in an email.

Like I said, since becoming a mod and knowing Scott does not run ACX Bot , all the resolutions have come from him through email, not his ACT Posts/Email list.

Can't resolve yet!

I'm surprised this market is still at 68%. Do we think this is likely to happen in like the next two weeks, or do I misunderstand the criteria?

Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

We won't know the result until Scott tells us, which will likely be in early 2024. The fact that we haven't heard anything yet is obviously still evidence for NO, but not strong evidence.

@chrisjbillington That's interesting! The deadline currently says Dec 31 2023. But maybe it would be extended past that to see if there's a future post on Astral Codex Ten?

@ChrisPrichard likely, yes. Market close dates aren't necessarily resolution deadlines, and since this pertains to Scott's annual predictions which he doesn't usually follow up on until later, it's likely the close date is just a placeholder and will either be extended or the market remain closed until resolution can be determined.

@chrisjbillington Oh, okay! TIL - thanks. :)

