Duplicated for 2024 here: https://manifold.markets/vluzko/will-there-be-realistic-ai-generate-6aeae916397c
Longer videos, but neither from natural language nor really "realistic": https://ai.googleblog.com/2022/11/infinite-nature-generating-3d.html
@Yev These seem like problems that could be partially solved by pipelining a few stages together. It seems within reach for "realistic enough" videos to be produced by decomposing a natural language description of a video into keyframe descriptions, from which individual images could be generated and then interpolated between.
This would necessitate extremely consistent images though, and I'm unsure as to how achievable that is with current image models.
Why did the wiggles stop? :(
There is a rumour that Stability AI already has a model similar to make-a-video but better. I am pretty sure it will be released this year, but probably it's not going to be realistic enough (and maybe will suffer with 30+ seconds as well).
Love seeing how wild the swings are on this market. Curious if the traders are bimodally distributed and if so what the two groups are.
This is probably actually really computationally expensive and the economy to huck at funny image generation models probably isn't going to be tastily at any sort of parity with DALL-E 2 without too fat of a stack of cash to justify that size of video model. Image/text/video generation all runs on heroic amounts of sunk manual toil, and even DALL-E 2 is not that far out of "routinely generates horrifying fleshbeasts" territory. consistently getting video that isn't gonna be full of multiple sequential slides of horrifying fleshbeasts sounds far more than a couple months and few kajillowatts of dissipated heat from now to me
@nfd On the subject of horrifying fleshbeasts, do you think the current models could handle generating purely close-up pornographic shots? It seems like a pretty narrow slice of the problem space that would nonetheless probably have enough economic utility to pay for the investigation. They have a lot of trouble with hands and faces, but perhaps those aren't necessary to the experience . . .
@nfd I imagine pornographic images are probably strongly underrepresented in most good natural-language-prompt models' training data. Some specialized classifiers and editing models like Yahoo's open-nsfw, waifu2x (raster upscaling trained on anime art), and DeepCreamPy (decensoring) are notable specialized exceptions for specific subdomains. Running open-nsfw backwards tends to generate incomprehensible fleshbeasts. Unless you're generating hentai (since e.g. danbooru is so well-tagged), I figure there's a long way to go. (And you better not mind that hentai having a few extra eyes or arms somewhere unexpected, I guess.)
Uncanny valley—Phenaki works because it’s low-res and playful; Meta works for “live” GIFs and isn’t even a video model.
Imagen manages to be unwatchable by trying and failing at both quality and coherency.
Pick one! (For another 6-18 months)
Phenaki is closed-source and nowhere near realistic — ~30% someone copies it and merges the meta layers or diffusion as post-processing
Very doable, just not clear who’s going to do it this soon (as they’d basically be ripping off the core architecture, >5 months before the ICLR conference where it will be discussed)
Seeing phenaki, think there's a decent chance I'll lose on this one, but betting on it is useful as a way for me to track whether progress is (again) faster than I expect.
As of 2022-09-30 Make-A-Video does not resolve this question: none of the videos released are long enough. The quality is borderline as well and I would need to see more non cherry picked examples to decide.
There's no technical limitation preventing Make-A-Video from producing a 30 second output, so it's possible it will resolve this question yes once someone actually tries and publishes the result.
AFAICT the paper doesn't directly state the scale of the models, although it suggests it's in the 1-10B parameters range. I would expect just scaling the model to remove most per-frame quality issues, and plausibly (~60%) hit the 30-second mark. AFAIK no one has measured scaling laws for T2V which is where most of my uncertainty is coming from.
@VictorLevoso Hmm. The quality is borderline as well, and the 2 minute video may be cherrypicked. I think this could resolve it, and a scaled up version almost certainly would, but I will not actually resolve until I see more long videos. (I am also going to have to go back and review the original DALL-E 2 images so my standards don't get warped by current image generation capabilities)
I don't think the meta thing counts and duno what you will count as good enough but expect to see multiple relevant papers soon.
More likely we'll have it by next year and market will resolve no but it's plausible by December.
On the other hand it seems likely that people will be working on and maybe have ready a good enough model but it won't come out until later.
Publications can lag months behind actual progress, though on the other hand conferences and a desire to one up meta can prompt people to show things early.
So I wouldn't be that surprised if we see a big model by someone else that looks like it's more than 3 months of progress fronm the meta paper, precisely because both things actually started months ago.
Also I've read somewhere that people will be getting H100 soon so maybe that will change things.
Bad GIFs to realistic video is not a 3/month process
Meta isn’t even replying to their own tweet with examples 🤔