Minimum video length of 2 minutes, and must maintain coherence. The visuals, dialogue, and sound must all be of "reasonable" quality: it does not need to be indistinguishable from human made video, but there shouldn't be significant artifacts.
@vluzko https://www.youtube.com/watch?v=jz78fSnBG0s In what ways does this not pass the test? Because of the video creator splicing the clips together?
@Nikola The splicing hurts it, but the main thing is that this is a question about being able to generate many kinds of video, not any video. Think DALL-E 2 but for video with sound (although I do not require the inputs to be purely text)
@vluzko Two minutes is a really long time, and we don't even have a good DALL-E 1 equivalent for long video, let alone video + sound