50. Will someone release "DALL-E, but for videos" in 2023?
80%
chance

This question resolves positively if some person or group credibly announces the existence of an AI that can generate appropriate videos at least 10 seconds long from arbitrary prompts (similar to the way DALL-E generates images from arbitrary prompts) and provides examples.

This is question #50 in the Astral Codex Ten 2023 Prediction Contest. The contest rules and full list of questions are available here. Market will resolve according to Scott Alexander’s judgment, as given through future posts on Astral Codex Ten.

Sort by:
MichaelRobertson avatar
Michael Robertsonbought Ṁ34 of YES

I think this would count.

  1. Text to video

  1. examples are 5-10 seconds long

  2. Can take any text as imput


    https://makeavideo.studio/

Kronopath avatar
Kronopathbought Ṁ10 of YES

Yeah, this should have been resolved “Yes” before the prediction contest even started. This very clearly counts for the resolution criteria as stated, these videos are 10 seconds long and generated from arbitrary prompts.

The only remaining uncertainty is whether Scott’s going to say “lol this one isn’t good enough” for some incoherent reason.

21eleven avatar
21elevenis predicting NO at 86%

@MichaelRobertson that's not a production service tho. It might be in the future but I cannot sign up and create video (currently) like one can sign up for DALL-E and create images.

21eleven avatar
21elevenis predicting NO at 86%

@Kronopath you are seeing Meta AI's cherry picked results on that web page. You cannot use this as a service making it very different from DALL-E

Kronopath avatar
Kronopathis predicting YES at 86%

@21eleven Doesn’t matter, none of that is in the resolution criteria. All that’s required is if someone “credibly announces the existence of [such] an AI […] and provides examples”. No need for this to become a production service or provide non-cherry-picked examples.

21eleven avatar
21elevenbought Ṁ50 of NO

You know this market is high on hopium bc it is above the image compositionality related market.:

https://manifold.markets/ACXBot/41-will-an-image-model-win-scott-al?r=MjFlbGV2ZW4

21eleven avatar
21elevenis predicting NO at 81%

@21eleven not anymore 🙂

PhilWheatley avatar
Philis predicting YES at 85%
BTE avatar
BTEis predicting YES at 85%

@PhilWheatley That’s cool. But why does it say google when they don’t work at google??

BTE avatar
BTEis predicting YES at 85%

@PhilWheatley And for the record I absolutely think this is gonna happen. It just didn’t happen last April. Maybe this April though…

PhilWheatley avatar
Philis predicting YES at 85%

@BTE im not really sure why google is mentioned in the title and video, as its not mentioned in the paper: https://tuneavideo.github.io

sorry, dont have an answer for you.

MichaelRobertson avatar
Michael Robertsonis predicting YES at 85%

@BTE @PhilWheatley This project was built on top of googles "Nerfies" project (https://github.com/google/nerfies). They haven't released their code yet, but if its just a custom data set they trained against, it would make sense to call it "Googles AI".

PhilWheatley avatar
Philis predicting YES at 85%

@MichaelRobertson cheers for the clarity

21eleven avatar
21elevenis predicting NO at 85%

@PhilWheatley I don't think that counts as it requires an input video to style transfer onto where that style transfer is based on some user text. A DALL-E for videos would be based on a text prompt alone.

PhilWheatley avatar
Philis predicting YES at 85%

@21eleven fair point, it hadn't explicitly excluded a video input so thought I'd point it out at least

MichaelRobertson avatar
Michael Robertsonis predicting YES at 80%

This should resolve True, here is a paper:
https://arxiv.org/abs/2204.03458
And here is an example:
https://video-diffusion.github.io/

BTE avatar
BTEis predicting YES at 80%

@MichaelRobertson I can tell you for certain this is not going to resolve YES based on something published last April. Those videos are more like gifs than videos. As evidenced by the fact that they embedded dozens of them on a single webpage (though it still crashed almost immediately so somewhere between gif and video.

PhilWheatley avatar
Philis predicting YES at 85%

@BTE what does that even mean? whats the difference between a gif and a video other than one having the capability of sound?

BTE avatar
BTEis predicting YES at 85%

@PhilWheatley Is that a serious question? How many videos do you watch in a constant 3 second loop?

PhilWheatley avatar
Philis predicting YES at 85%

@BTE where in the definition of a video does it say it needs to be a certain length? a video is defined as "a recording of moving visual images" i.e. a series of images to depict motin, same thing as what a gif does, by definition they are the same.

BTE avatar
BTEis predicting YES at 85%

@PhilWheatley Dude, if it was a video they would call it a video instead of a gif. You aren’t going to convince me. And it’s irrelevant because, again, that paper was published last April. That clearly means @ScottAlexander doesn’t think gifs are videos and his opinion is the only one that matters.

PhilWheatley avatar
Philis predicting YES at 85%

@BTE ? are you serious? why do they call things gif's instead of pictures then? or why are some videos mpegs? if you are going to use a file format to define it then there are no pictures on the internet, only jpegs, gif's etc. no videos only avi, mpg's etc. and im not arguing in favour of that paper to be clear, just that gif and video are not distinct from each other, both are a slideshow of images.

BTE avatar
BTEis predicting YES at 85%

@PhilWheatley Don’t waste you’re time.

PhilWheatley avatar
Philis predicting YES at 85%

@BTE thanks for the reminder, i forget sometimes it's fruitless to argue with the ignorant.

Mira avatar
Mirabought Ṁ73 of NO

Videos are a natural extension of images, and the constraint of having to be coherent over time means it's not so much more information, so YES seems understandable.

I think images have a closer semantic tie to the model than videos would, though; so it will be much harder to train good-quality videos. Something like Danbooru can have dozens of tags on an image, and most of them it's clear the specific section of image that "red ribbon" refers to.

There are some videos that would be easier(small variations on a single scene(like a butterfly flapping its wings in a loop), videos of 3D scenes with geometric shapes(like raytracing demos)), but if it means the DeepDream-style where you start with an image and repeatedly make it "look more like the prompt". I think those would struggle with videos that aren't just small variations on an image.