Will there be realistic AI generated video from natural language descriptions by the start of 2023?
13%
chance
Resolves yes if there is a model that receives a natural language description (e.g."Give me a video of a puppy playing with a kitten") and outputs a realistic looking video matching the description. It does *not* have to be *undetectable* as AI generated, merely "realistic enough". It must be able to consistently generate realistic videos >=30 seconds long to count. DALL-E 2 (https://cdn.openai.com/papers/dall-e-2.pdf) counts as "realistic enough" *image* generation from natural language descriptions (I am writing this before the model is fully available, if it turns out that all the samples are heavily cherry picked DALL-E 2 does not count but a hypothetical model as good as the cherry picked examples would).
Sort by:
VictorLevoso avatar
Victor Levoso
bought Ṁ55 of YES
ValeryCherepanov avatar
Valery Cherepanov
is predicting NO at 22%

@VictorLevoso It's <30 seconds and imho not "realistic enough".

VictorLevoso avatar
Victor Levoso
is predicting YES at 17%

@ValeryCherepanov Was mostly posting cause it's interesting and information that might be useful for people betting on this market rather than proposing it resolves yes based on it.

L avatar
L
is predicting NO at 14%

@VictorLevoso I flipped my bets because it looks like the parts of rendering won't be integrated fully in time. if we had naturally 3d rendering to disentangle 3d representation, then maybe. but currently we're waiting for natural 3d to match stable diffusion.

ValeryCherepanov avatar
Valery Cherepanov
bought Ṁ34 of NO

Just 6 weeks left

Multicore avatar
Multicore
is predicting NO at 35%

Longer videos, but neither from natural language nor really "realistic": https://ai.googleblog.com/2022/11/infinite-nature-generating-3d.html

MatthewBream avatar
Matthew Bream
bought Ṁ10 of NO

Don't think it meets 30 sec requirement but https://imagen.research.google/video/

Yev avatar
Yev ✔️
is predicting NO at 44%

@o I think it fails the 30 second requirement?

Yev avatar
Yev ✔️
is predicting NO at 44%

@Yev And it doesn't receive a natural language description.

So it won't resolve this market. Still cool though!

CollinGray avatar
Collin Gray
sold Ṁ290 of NO

@Yev These seem like problems that could be partially solved by pipelining a few stages together. It seems within reach for "realistic enough" videos to be produced by decomposing a natural language description of a video into keyframe descriptions, from which individual images could be generated and then interpolated between.

This would necessitate extremely consistent images though, and I'm unsure as to how achievable that is with current image models.

Yev avatar
Yev ✔️
is predicting NO at 37%

Why did the wiggles stop? :(

vluzko avatar

@Yev Cowardice, one assumes.

ValeryCherepanov avatar
Valery Cherepanov
bought Ṁ10 of NO

There is a rumour that Stability AI already has a model similar to make-a-video but better. I am pretty sure it will be released this year, but probably it's not going to be realistic enough (and maybe will suffer with 30+ seconds as well).

vluzko avatar

Love seeing how wild the swings are on this market. Curious if the traders are bimodally distributed and if so what the two groups are.

nfd avatar
nfd
bought Ṁ5 of NO

This is probably actually really computationally expensive and the economy to huck at funny image generation models probably isn't going to be tastily at any sort of parity with DALL-E 2 without too fat of a stack of cash to justify that size of video model. Image/text/video generation all runs on heroic amounts of sunk manual toil, and even DALL-E 2 is not that far out of "routinely generates horrifying fleshbeasts" territory. consistently getting video that isn't gonna be full of multiple sequential slides of horrifying fleshbeasts sounds far more than a couple months and few kajillowatts of dissipated heat from now to me

AndrewHartman avatar
Andrew Hartman
is predicting NO at 30%

@nfd On the subject of horrifying fleshbeasts, do you think the current models could handle generating purely close-up pornographic shots? It seems like a pretty narrow slice of the problem space that would nonetheless probably have enough economic utility to pay for the investigation. They have a lot of trouble with hands and faces, but perhaps those aren't necessary to the experience . . .

nfd avatar
nfd
is predicting NO at 40%

@nfd I imagine pornographic images are probably strongly underrepresented in most good natural-language-prompt models' training data. Some specialized classifiers and editing models like Yahoo's open-nsfw, waifu2x (raster upscaling trained on anime art), and DeepCreamPy (decensoring) are notable specialized exceptions for specific subdomains. Running open-nsfw backwards tends to generate incomprehensible fleshbeasts. Unless you're generating hentai (since e.g. danbooru is so well-tagged), I figure there's a long way to go. (And you better not mind that hentai having a few extra eyes or arms somewhere unexpected, I guess.)

Lorenzo avatar
Lorenzo
is predicting NO at 66%

Does this count as "realistic enough"? https://imagen.research.google/video/ I would say no but hard to tell

vluzko avatar

@Lorenzo Not realistic enough. This looks to me a lot like the image models we had immediately pre DALL-E 2, in that they are mostly there but look like everything is part of an acid trip

Gigacasting avatar
Gigacasting
is predicting NO at 66%

Uncanny valley—Phenaki works because it’s low-res and playful; Meta works for “live” GIFs and isn’t even a video model.

Imagen manages to be unwatchable by trying and failing at both quality and coherency.

Pick one! (For another 6-18 months)

Gigacasting avatar

Phenaki is closed-source and nowhere near realistic — ~30% someone copies it and merges the meta layers or diffusion as post-processing

Very doable, just not clear who’s going to do it this soon (as they’d basically be ripping off the core architecture, >5 months before the ICLR conference where it will be discussed)

StephenMalina avatar
Stephen Malina
bought Ṁ10 of NO

Seeing phenaki, think there's a decent chance I'll lose on this one, but betting on it is useful as a way for me to track whether progress is (again) faster than I expect.

vluzko avatar

As of 2022-09-30 Make-A-Video does not resolve this question: none of the videos released are long enough. The quality is borderline as well and I would need to see more non cherry picked examples to decide.

There's no technical limitation preventing Make-A-Video from producing a 30 second output, so it's possible it will resolve this question yes once someone actually tries and publishes the result.

AFAICT the paper doesn't directly state the scale of the models, although it suggests it's in the 1-10B parameters range. I would expect just scaling the model to remove most per-frame quality issues, and plausibly (~60%) hit the 30-second mark. AFAIK no one has measured scaling laws for T2V which is where most of my uncertainty is coming from.

VictorLevoso avatar
Victor Levoso
bought Ṁ100 of YES
vluzko avatar

@VictorLevoso Hmm. The quality is borderline as well, and the 2 minute video may be cherrypicked. I think this could resolve it, and a scaled up version almost certainly would, but I will not actually resolve until I see more long videos. (I am also going to have to go back and review the original DALL-E 2 images so my standards don't get warped by current image generation capabilities)

VictorLevoso avatar
Victor Levoso
is predicting YES at 54%

I don't think the meta thing counts and duno what you will count as good enough but expect to see multiple relevant papers soon.

More likely we'll have it by next year and market will resolve no but it's plausible by December.

On the other hand it seems likely that people will be working on and maybe have ready a good enough model but it won't come out until later.

Publications can lag months behind actual progress, though on the other hand conferences and a desire to one up meta can prompt people to show things early.

So I wouldn't be that surprised if we see a big model by someone else that looks like it's more than 3 months of progress fronm the meta paper, precisely because both things actually started months ago.

Also I've read somewhere that people will be getting H100 soon so maybe that will change things.

o avatar
Orpheus
bought Ṁ50 of YES

Now the open question of this market is on the video length

wasabipesto avatar
wasabipesto
bought Ṁ50 of YES
Gigacasting avatar
Gigacasting
bought Ṁ50 of NO

Bad GIFs to realistic video is not a 3/month process

Meta isn’t even replying to their own tweet with examples 🤔

ms avatar
Mikhail Samin
is predicting YES at 13%

https://makeavideo.studio/ I think 13% is too low

SamuelRichardson avatar
Samuel Richardson
is predicting NO at 12%
Looks like we're getting close to this: https://text2live.github.io/ I can't see any evidence that it outputs a >=30 second video though.
SamuelRichardson avatar
Samuel Richardson
bought Ṁ100 of NO
I haven't seen any mention of stable generation of AI imagery over time yet. DALL-E 2 works great for single images, but making the image stable over time to animate it? A whole other story!