Gary Marcus made a post discussing the Imagen and DALLE-2 model's inability to fully grasp language, particularly around relational undestanding of objects in a prompt: https://garymarcus.substack.com/p/horse-rides-astronaut
Stability AI just released Stable Diffusion 3 (https://stability.ai/news/stable-diffusion-3), which they claim has "greatly improved performance in multi-subject prompts, image quality, and spelling abilities".
Once publicly available, I will run this prompt from DeWeese lab that is discussed heavily in the post:
A red conical block on top of a grey cubic block on top of a blue cylindrical block, with a green cubic block nearby
I will produce 10 images. If 5 or more of the images match the prompt exactly, following the color, shape, and positions specified in the prompt, this market resolves YES. Otherwise, it resolves NO.
I will not bet in this market in case there is ambiguity on some of the images.
Image 1 (fail - noncubic green block, blue block not cylinder/not on bottom).
Image 2 (fail - noncubic green block, blue block not cylinder)
Image 3 (fail - noncubic green block)
Image 4 (fail - so close! grey block not on top of blue cylinder)
Image 5 (obviously fail)
Image 6 (obviously fail)