41. Will an image model win Scott Alexander’s bet on compositionality, to Edwin Chen’s satisfaction, in 2023?

187

2.1kṀ59k

resolved Jan 9

Resolved

ALL

See https://www.surgehq.ai/blog/dall-e-vs-imagen-and-evaluating-astral-codex-tens-3000-ai-bet . Scott and Edwin will try to get the top image models of late 2023 to try the specific questions in the bet. If we can’t access the models, then Edwin can use public demos of the image models and his own best guess to resolve this as either likely true, likely false, or unclear. Edwin believes current AI models have not won the bet, so if there is no clear progress he should resolve the bet false. If Edwin is unwilling to judge this, Gary Marcus will be used as the substitute; if neither of these two people will do it, the question resolves as unclear.

This is question #41 in the Astral Codex Ten 2023 Prediction Contest. The contest rules and full list of questions are available here. Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

ACX

ACX 2023 Prediction Contest

New Year's Resolutions 2024

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ5,338
2		Ṁ1,124
3		Ṁ1,112
4		Ṁ1,104
5		Ṁ751

People are also trading

Which company has best Text-to-Image AI model end of 2025? (Artificial Analysis Leaderboard)

Meta-Learning Compositionality (MLC) in state of the art AI models by Oct. 2025?

17% chance

Will Scott Alexander state that Vitor has paid him for winning their AI image bet this year?

40% chance

Will AI be able to describe the state of a chessboard from an image before 2026?

88% chance

Will OpenAI release a model which generates images using reasoning / inference-time scaling before 2026?

48% chance

Before 2026, Will DL systems outperform humans at describing a picture in words to make human mental images match it?

33% chance

Will I be impressed by someone using RL through self-play to improve model creativity or aesthetics in 2025?

45% chance

Will Bryan Caplan win his bet with Matthew Barnett on whether an AI can pass his exams in 2029?

6% chance

Before 2030, will an AI system be able to solve compositional problems of arbitrary depth?

47% chance

Will future language models converge on "what Einstein would have thought of Many-Worlds?" before 2036?

Sort by:

predictedYES

I tried on my own to get the fox one. Here was my FIRST attempt:

passes

predictedYES

@benshindel My first attempt today passed as well (1/2):

Second attempt also passed (1/2):

Third attempt passed (2/2):

Fourth attempt passed (1/2):

Fifth attempt passed (1/2):

Sixth attempt passed (1/2):

Seventh attempt failed (0/2).

Eighth attempt passed (2/2):

Ninth attempt failed (0/2).

Tenth attempt failed (0/2).

It's not outside the realm of possibility, but it seems to me that the odds that all ten images fail is somewhere in the vicinity of 0.5-1%, unless my implementation is somehow wrong.

predictedNO

I would say Chen assessment is fair, as there is still a compositionality issue with generative AI. Dall-e 3 has shown progress but is still falling difficult cases, Midjourney is way behind.

predictedNO

On the one hand, yes, 'are you kidding me.' On the other hand, as I said when I first bet, know your judge, this was always going to be a stickler, and I looked at the post and I agree that technically the test was failed.

predictedYES

@ZviMowshowitz I honestly think the two fox images should have passed. Like, you can always raise some narrow technical complaint along the lines of "the astronaut wasn't quite holding the fox" or "the fox's lipstick looked fake". Imagine if they'd all been acceptable, you can always get more fine-grained:
-the pixel art had non-square elements so it's not quite pixel art

-the building had machinery but wasn't explicitly a "factory"

-the cathedral was missing gothic architecture so it's just a church

etc, etc... i think it misses the point of the compositionality bet to argue about whether the lipstick was lipstick-y enough, for instance.

predictedYES

…are you kidding me. I accept that this is resolved fairly because the criteria gave Edwin Chen as the judge, but… Edwin Chen is a bad judge. I have run these prompts through DALL-E 3 and it pretty clearly passes the explicit conditions :/

predictedYES

@benshindel in my ACX prediction spreadsheet last year, I wrote this note down next to this question: "is Edwin Chen an asshole?" Clearly he is since he doesn't want to admit he is wrong, but this is the sort of question where this risk is easy to identify, so no sympathy for you!

predictedNO

@benshindel Well, you can see every step of the evaluation, and it is not only Edwin Chen, but 5 evaluators. Images are there, and why they fail. The only case that is a bit subjective is the farmer and it's red basketball. However this was one of the reason prompt was difficult, as most basketball are orange it's more difficult for a model to draw it in red. So it is fair to reject orange case, instead of saying orange is a kind of red.

predictedYES

Resolving NO as per link posted below.

Failure!

https://www.surgehq.ai/blog/dalle-3-and-midjourney-fail-astral-codex-tens-image-generation-bet

@traders

predictedYES

Ugh that’s really annoying :/

📢Results within 1 Week from 1/4/2024 per Scott.

edited

predictedYES

@SirCryptomind whoah. Can we not wait a week? I put hours and hours into this over the last year

predictedYES

@Ernie also isn't chickening out a type of losing? If someone challenges you to a bet then refuses to show up and make their case, they just lose

@Ernie I assume these are the same answers given to Metaculus. If he changes his mind, I am sure he will email me.

predictedYES

@Ernie I would agree with waiting if there's a chance someone here will be able to get in touch with Chen. Does anyone think they can?

Unfortunately, the question really is about what Chen thinks, so I don't see how it can resolve to a non-N/A answer without getting an answer from him. I guess I don't buy that a non-response counts as a YES.

Is there actually a bet at this point between the two?

predictedNO

@SirCryptomind Did @ScottAlexander already check with Gary Marcus? From description:

If Edwin is unwilling to judge this, Gary Marcus will be used as the substitute

@MartinRandall I am just simply the messenger here, I emailed him and did what I always do with his answers. I post thee email as proof and resolve and tell Metaculus if needed.

predictedNO

@SirCryptomind should resolve through a post:

Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

I would expect Scott to notice his backup plan while writing a post but perhaps not in an email.

Like I said, since becoming a mod and knowing Scott does not run ACX Bot , all the resolutions have come from him through email, not his ACT Posts/Email list.

https://discord.com/channels/915138780216823849/938171760237477998/1192467198611030116
Thread

Can't resolve yet!

predictedYES

I'm surprised this market is still at 68%. Do we think this is likely to happen in like the next two weeks, or do I misunderstand the criteria?

predictedYES

@ChrisPrichard

Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

We won't know the result until Scott tells us, which will likely be in early 2024. The fact that we haven't heard anything yet is obviously still evidence for NO, but not strong evidence.

@chrisjbillington That's interesting! The deadline currently says Dec 31 2023. But maybe it would be extended past that to see if there's a future post on Astral Codex Ten?

predictedYES

@ChrisPrichard likely, yes. Market close dates aren't necessarily resolution deadlines, and since this pertains to Scott's annual predictions which he doesn't usually follow up on until later, it's likely the close date is just a placeholder and will either be extended or the market remain closed until resolution can be determined.