Will there be realistic AI generated video from natural language descriptions by the start of 2023?
Basic
197
Ṁ33k
resolved Jan 3
Resolved
NO
Resolves yes if there is a model that receives a natural language description (e.g."Give me a video of a puppy playing with a kitten") and outputs a realistic looking video matching the description. It does *not* have to be *undetectable* as AI generated, merely "realistic enough". It must be able to consistently generate realistic videos >=30 seconds long to count. DALL-E 2 (https://cdn.openai.com/papers/dall-e-2.pdf) counts as "realistic enough" *image* generation from natural language descriptions (I am writing this before the model is fully available, if it turns out that all the samples are heavily cherry picked DALL-E 2 does not count but a hypothetical model as good as the cherry picked examples would).
Get
Ṁ1,000
and
S3.00
Sort by:

Last call to submit anything, otherwise I am resolving NO tomorrow.

predicted YES

@vluzko Have you seen the Google’s merge of Phenaki with Imagen video? https://twitter.com/alonsorobots/status/1587913514210840576

predicted NO

@ms That's pretty good but if you take the definition of "realistic" as:

seeming to exist or be happening in fact

I think we're still not there quite yet

predicted NO

@ms Not realistic, IMO. There are really bad artifacts in the astronaut, it gets blockish and skewed, changing profile in a way that is very distinct from perspective and limb motion. The backpack gets melded into the arm, and many frames are marred with polygon artifacts like a horrible compression algorithm. For example, the one below.

@ms This is very impressive but certainly not realistic. Also if people could link me to papers or publications instead of twitter I would appreciate that, I have twitter blocked. Not that I expect anyone to remember this request but maybe if I'm consistently annoying about it people will start to.

predicted NO

@SamuelRichardson There's something really philosophically elegant about linking the component definitions, there.

predicted NO

@L I'll take that even though I totally did it by accident when copy and pasting from the site I got it from lol

predicted NO

@VictorLevoso It's <30 seconds and imho not "realistic enough".

predicted YES

@ValeryCherepanov Was mostly posting cause it's interesting and information that might be useful for people betting on this market rather than proposing it resolves yes based on it.

predicted NO

@VictorLevoso I flipped my bets because it looks like the parts of rendering won't be integrated fully in time. if we had naturally 3d rendering to disentangle 3d representation, then maybe. but currently we're waiting for natural 3d to match stable diffusion.

Just 6 weeks left

predicted NO

Longer videos, but neither from natural language nor really "realistic": https://ai.googleblog.com/2022/11/infinite-nature-generating-3d.html

Don't think it meets 30 sec requirement but https://imagen.research.google/video/

predicted NO

@o I think it fails the 30 second requirement?

predicted NO

@Yev And it doesn't receive a natural language description.

So it won't resolve this market. Still cool though!

@Yev These seem like problems that could be partially solved by pipelining a few stages together. It seems within reach for "realistic enough" videos to be produced by decomposing a natural language description of a video into keyframe descriptions, from which individual images could be generated and then interpolated between.

This would necessitate extremely consistent images though, and I'm unsure as to how achievable that is with current image models.

predicted NO

Why did the wiggles stop? :(

@Yev Cowardice, one assumes.

There is a rumour that Stability AI already has a model similar to make-a-video but better. I am pretty sure it will be released this year, but probably it's not going to be realistic enough (and maybe will suffer with 30+ seconds as well).

Love seeing how wild the swings are on this market. Curious if the traders are bimodally distributed and if so what the two groups are.

This is probably actually really computationally expensive and the economy to huck at funny image generation models probably isn't going to be tastily at any sort of parity with DALL-E 2 without too fat of a stack of cash to justify that size of video model. Image/text/video generation all runs on heroic amounts of sunk manual toil, and even DALL-E 2 is not that far out of "routinely generates horrifying fleshbeasts" territory. consistently getting video that isn't gonna be full of multiple sequential slides of horrifying fleshbeasts sounds far more than a couple months and few kajillowatts of dissipated heat from now to me

predicted NO

@nfd On the subject of horrifying fleshbeasts, do you think the current models could handle generating purely close-up pornographic shots? It seems like a pretty narrow slice of the problem space that would nonetheless probably have enough economic utility to pay for the investigation. They have a lot of trouble with hands and faces, but perhaps those aren't necessary to the experience . . .

predicted NO

@nfd I imagine pornographic images are probably strongly underrepresented in most good natural-language-prompt models' training data. Some specialized classifiers and editing models like Yahoo's open-nsfw, waifu2x (raster upscaling trained on anime art), and DeepCreamPy (decensoring) are notable specialized exceptions for specific subdomains. Running open-nsfw backwards tends to generate incomprehensible fleshbeasts. Unless you're generating hentai (since e.g. danbooru is so well-tagged), I figure there's a long way to go. (And you better not mind that hentai having a few extra eyes or arms somewhere unexpected, I guess.)

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules