@vluzko Have you seen the Google’s merge of Phenaki with Imagen video? https://twitter.com/alonsorobots/status/1587913514210840576
@ms Not realistic, IMO. There are really bad artifacts in the astronaut, it gets blockish and skewed, changing profile in a way that is very distinct from perspective and limb motion. The backpack gets melded into the arm, and many frames are marred with polygon artifacts like a horrible compression algorithm. For example, the one below.
@ms This is very impressive but certainly not realistic. Also if people could link me to papers or publications instead of twitter I would appreciate that, I have twitter blocked. Not that I expect anyone to remember this request but maybe if I'm consistently annoying about it people will start to.
@SamuelRichardson There's something really philosophically elegant about linking the component definitions, there.
@L I'll take that even though I totally did it by accident when copy and pasting from the site I got it from lol
@ValeryCherepanov Was mostly posting cause it's interesting and information that might be useful for people betting on this market rather than proposing it resolves yes based on it.
@VictorLevoso I flipped my bets because it looks like the parts of rendering won't be integrated fully in time. if we had naturally 3d rendering to disentangle 3d representation, then maybe. but currently we're waiting for natural 3d to match stable diffusion.
Duplicated for 2024 here: https://manifold.markets/vluzko/will-there-be-realistic-ai-generate-6aeae916397c
Longer videos, but neither from natural language nor really "realistic": https://ai.googleblog.com/2022/11/infinite-nature-generating-3d.html
@Yev And it doesn't receive a natural language description.
So it won't resolve this market. Still cool though!
@Yev These seem like problems that could be partially solved by pipelining a few stages together. It seems within reach for "realistic enough" videos to be produced by decomposing a natural language description of a video into keyframe descriptions, from which individual images could be generated and then interpolated between.
This would necessitate extremely consistent images though, and I'm unsure as to how achievable that is with current image models.
This is probably actually really computationally expensive and the economy to huck at funny image generation models probably isn't going to be tastily at any sort of parity with DALL-E 2 without too fat of a stack of cash to justify that size of video model. Image/text/video generation all runs on heroic amounts of sunk manual toil, and even DALL-E 2 is not that far out of "routinely generates horrifying fleshbeasts" territory. consistently getting video that isn't gonna be full of multiple sequential slides of horrifying fleshbeasts sounds far more than a couple months and few kajillowatts of dissipated heat from now to me
@nfd On the subject of horrifying fleshbeasts, do you think the current models could handle generating purely close-up pornographic shots? It seems like a pretty narrow slice of the problem space that would nonetheless probably have enough economic utility to pay for the investigation. They have a lot of trouble with hands and faces, but perhaps those aren't necessary to the experience . . .
@nfd I imagine pornographic images are probably strongly underrepresented in most good natural-language-prompt models' training data. Some specialized classifiers and editing models like Yahoo's open-nsfw, waifu2x (raster upscaling trained on anime art), and DeepCreamPy (decensoring) are notable specialized exceptions for specific subdomains. Running open-nsfw backwards tends to generate incomprehensible fleshbeasts. Unless you're generating hentai (since e.g. danbooru is so well-tagged), I figure there's a long way to go. (And you better not mind that hentai having a few extra eyes or arms somewhere unexpected, I guess.)