Note: this is for image generating AIs, not AIs that describe what is happening in an existing image.
Every new DALL-E and Midjourney version I have tried the following prompt:
"A red sports car with an old lady driving it and eating a live octopus, with a blue footed booby eating a hamburger in the passenger seat."
So far all versions fail to get all of the details right. They mix up who is eating the octopus vs who is eating the hamburger, who is driving vs in the passenger seat, who is eating vs getting eaten, etc.
YES criteria would be being able to ask an image generating AI 10 random variations on this prompt (substituting random variants of vehicle, person, animals, seat etc) consecutively and having all subjects be in the correct places doing the correct actions all 10 times.