Based on this tweet https://twitter.com/ArthurB/status/1528991584309624832 Question resolves positive if a model is capable of generating arbitrary videos of reasonable quality from text prompts and demonstrates "object permanency" in the sense that it can resolve full object oclusions correctly(for example a mountain being temporarily hidden by a low cloud should still be there after the cloud moves) most of the time. If it's unclear whether some existing model has the capabilities by the deadline I'll use my judgment to decide how to resolve the market, and will lean towards yes in cases where the model mostly does it correctly for simple videos but fails at cherrypicked edge cases.
Edit:
For extra clarification published means that at least a paper or some oficial announcement has to come out about it before that date.
Also the fact I haven't resolved yes doesn't necesarily mean that I think none of the stuff that is out yet counts, I'm likely going to wait until June in any case and resolve based on the best model avaliable by then, unless something pretty clearly counts before then (in case someone was updating on how harsh my judgment is based on seeing that I haven't resolved despite whatever model coming out).
Emad recently said (Raoul Pal video) that StabilityAI text-to-video will be likely released in May.

A machine that creates art,
With a single click to start,
But can it make videos that sing?
Only time will tell, ding-a-ling.


What does published mean? Must it just be demonstrated that such a model existed (even with limited access) before June 2023?

@Sky
If theres a closed beta and I get enough info from people trying the model to say the model gets object permanency mostly right this resolves yes.
If for example OpenAI had such a model(witch might already be the case) but they don't release it in any way until after that date then it doesn't count and resolves yes.
If theres's a paper but people cant try the model but then we are in unclear territory and

@VictorLevoso ups clicked send before finishing the message:
If there's a paper/blog/anouncement before that date and lots of disagreement among about whether the tests on it show it counts or are just tcherripicked and the market doesn't have any strong opinion either way I'll either resolve N/A or use my own judgment to decide. if its clear that it does.

@VictorLevoso also obsl meant resolves no after.
"If for example OpenAI had such a model."
Ugh kind of anoying not being able to edit

@Sky I think the snippets in the video might qualify as non-crappy but they are probably cherry-picked and average performance will be significantly worse.



@Oldmeme not explicitly, but if it's only a few frames then it's likely to be subjectively "crappy" and it will also be quite difficult to convincingly demonstrate object permanence.

Some more examples. The clips seem long enough https://twitter.com/yining_shi/status/1637840817963278337?s=46&t=15dYmsX5AFHbZyJ-0PnIKA

Hugging face:
https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main
## Example clips:
Star wars clip using text to video model:
https://twitter.com/victormustar/status/1637461621541949441
Is this a llama clip xD? https://twitter.com/justincmeans/status/1637517337426550785
Shark skiing across the desert! https://twitter.com/hanyingcl/status/1637424841950392321
Astronaut riding a horse, surfing spiderman: https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis
if it didn't include the object permanence bit I'd say maybe, but the problem as described here would require a ridiculous amount of computing power to run (let alone train)

@cherrvak certainly crappy 😃 But there are enough people working on it to have one of them do significant progress before June 😌
This seems like a strong contender: https://arxiv.org/abs/2302.07685
@NiclasKupper Look about where some of the image generation work was two or three years before it really took off.
Somewhat related market with slightly longer timeframe: https://manifold.markets/SamuelRichardson/will-you-be-able-to-use-ai-tools-to

@NiclasKupper Some example results here: https://sihyun.me/PVDM. (Though note that it does not produce these videos from text prompts)
Does this count? https://twitter.com/runwayml/status/1622594989384519682?s=46&t=xG3-unwphscZW3CuBHcpMQ
If not, what's missing from this example that would be needed?
@RealityQuotient It's really good, but it's not arbitrary, it's video-to-video. Right?



Closed out my position, primarily out of fear of the precise resolution criteria, which haven't been clarified. (See my comments below.)


@LarsDoucet kind of, that piece of metal the robot is playing with doesn't really seem to have a consistent size
@jonsimon How bout this?
https://twitter.com/bleedingedgeai/status/1621594463218040832?s=46&t=MfIpN2eAwonhMi-XlGge2g
@LarsDoucet Looks like it needs an input image or video, not just a text prompt. Also, it doesn't look like it handles object occlusions well
@jonsimon It isn't great yet, but they do seem to resolve the 'switching leg' problem by and large by enforcing a sort of spatial coherence.

"For one thing, in addition to ChatGPT and the outfit’s popular digital art generator, DALL-E, Altman confirmed that a video model is also coming, though he said that he “wouldn’t want to make a competent prediction about when,” adding that “it could be pretty soon; it’s a legitimate research project. It could take a while.”" https://techcrunch.com/2023/01/17/that-microsoft-deal-isnt-exclusive-video-is-coming-and-more-from-openai-ceo-sam-altman/


@NikitaBrancatisano Seems like the videos are generated by creating continuations of short videos, rather than being generated by text prompt
@TomShlomi You're right! I missed the text prompt bit 😔
Still, looks very promising

this is looking a little more impressive: https://yingqinghe.github.io/LVDM/
Emad just said that his timeline is 2-3 years. https://www.reddit.com/r/MachineLearning/comments/yw6s1i/comment/iwio085/?utm_source=share&utm_medium=web2x&context=3
@ValeryCherepanov Weird, I'd have thought they'd prioritize this. I'm unclear how much I should trust this estimate and how much to update on if they are working on it right now
Making of Spiderverse fan video.
Lots of manual editing, and not quite "text-to-video".
But quite promising, specially with regards to frame-to-frame visual consistency.
https://www.youtube.com/watch?v=QBWVHCYZ_Zs&t=1s
@JaimeSevilla I might have gotten too excited.
The frame to frame consistency in the actual video is done through optical flow reconstruction, not stableAI like in the first clip shown
Still crappy but the dog eating the ice cream weirded me out for sure https://www.creativebloq.com/news/meta-ai-video-generator
The recent progress is definitely more than I expected so soon, but the question asked for object permanency, and we haven't seen anything that looks like that, or even clips long enough to demonstrate it.

Phenaki shows very strong grasp of object perspective, rotation, zoom, layering, and permanency.
One can debate whether it is “crappy” (vastly less so than Dalle-1/mini) or “reasonable quality” (maybe) but it conditions on all prior frames and clearly passed the permanence tests.
@Gigacasting Hard disagree. It is pretty clear from its footage that it's just relying on the prompt to infer what should be there (see how the teddy bear never leaves water, but the scene just transition in a blurred fashion)

@JoyVoid This is why I was pushing for a pre-committed, specific test for object permanency. We're about to get bogged down into "what counts".
From an outside view, every debate about AI progress has the appearance of massive goalpost-moving (on both sides). Very frustrating.
I think the example from the market description (mountain being hidden by a cloud, then appearing after) should be agreeable? We have most of a year for someone to publish a model and someone else to try that prompt.
For examples of missing object permanency, the shape of the treeline noticeably changes when the rider crosses them in the top-center and lower-right images here.

A full occlusion of an object would be ideal. With a partial occlusion, it's hard to tell the difference between whether the model remembered the object (has object permanency) versus the model managed to infer a good enough filling from the parts of the model it sees (is really good at inpainting).

@FutureOwl I can't tell if I'm being pedantic and repetitive, or genuinely poking at an important issue. I apologize if it's the former.
As written in the market description, I think there's an important ambiguity. If the prompt is "a cloud passes in front of a mountain", then the test described, that "the mountain should still be there", is not in any way a test of object permanence. Of course the mountain is still there---the model can just read the prompt to see that there should be a mountain still there!
In short: it's very important that the thing that is permanent be something that isn't in the prompt, and is not inferable from the prompt+single previous frame.

@ScottLawrence As an aside (after watching those videos more closely), I also consider it an object permanence fail if the number of legs the horse has keeps changing!
But that's just beating a dead five-legged horse.

Mostly pedantic 🤔
The description asks for “most of the time”; it obviously passes that test in the videos and especially the longer video.
Picking the worst frame violates everything about the definitions (cherry picking edge cases)
Not opposed to calling this “crappy” but it’s missing the point entirely (or a massive difference in what the word “most” means 🤔) to focus on small objects or tree shape when he literally asked for “mountains” to not fully disappear “most of the time”
(No position as it’s very reasonable to go with either result.)
@Gigacasting I actually picked those trees because they were the only ones I could find where the horse went far enough that there was a full occlusion.

Revising: Phenaki with any post-processing achieves this.
While Deepmind has stayed away from the text-image hype race, and OpenAI has been closed off and uncreative, Stable Diffusion clearly sparked something, and people are finally being clever in architectural design and compute efficiency.
Even a moderate combo of Phenaki with the final layers from the Meta paper will produce this.



@Gigacasting Oh yeah I also just saw that paper on twitter, came here to see if someone had mentioned it.
Seems like Iclr is going to have lots of image2text stuff apparently.
Also after skimimg the meta paper while I think there's no clear distinction between gifs and videos , I do agree that only being able to basically specify the first frame of the video and have the video basically be interpolation from that is a big limitation and arguably doesn't fit the spirit of "arbitrary video" maybe.
But this anon paper doesn't seem to have that problem from looking at the abstract and examples so don't think whether that counts that will really matter for the resolution.
(not that I'm going to resolve yet but thinking about what my criterion will be).


These are not only NOT “arbitrary videos” they aren’t even videos at all.
They turn a prompt into an image, then do slight animation around it (similar to iPhone “Live” photos with a few bordering frames)
Very far from resolution—this is high-res gif animation and bears no resemblance to video

Here is a pretty convincing demo
https://mobile.twitter.com/hardmaru/status/1575476224880934913
It's still pretty bad with respect to the question, but this is Dalle1 level of crappy. Would be very surprised we don't reach stablediffusion level by June 2023 (I am less sure about an open-source/toxiccandy model by then)


@JaimeSevilla Good demo, but still in crappy territory. Hoping to see a lot of progress in the near future. If you read the paper, you will see the limitations of this approach, mainly because we have GPUs with low memory. So this is not the path if we are hoping to get text-video with the current hardware.

@JaimeSevilla Yeah I saw that.
The demo looks nice but I'm not going to resolve yes yet.
Like it's possible that their model is good enough already but I'm fine whith waiting a few months untill we have better models and is more obious, especially if stability releases an open version I can play arround whith or something.
If we somehow don't get any better video models before the close date I'll think hard about it and ask people whether they think it should resolve based on this demo but that seems very unlikely.
@VictorLevoso In my opinion, this should not resolve yes yet, with respect to the "non-crappy" part of your question

@JoyVoid I think I mostly agree yeah and it does feel more like dalle-1 and looking at the paper seems it has some limitations arround how prompts work.
It is interesting that it seems to not fail obiously at object permanecy specifically but we also only have 4 potentially cherripicked 5 second videos so.
Btw to be clear I don't require an open source version to be released to resolve yes, it will just make my job easier and the resolution less contentious.

On the basis of the "object permanency" requirement, I'm betting NO. I do think it would be best if we could agree on a set of test cases, particularly in light of SA's recent claim to have effectively won a bet of this sort.
Can I suggest: a video of a crowd of people, with a car (or train) passing in front.
@ScottLawrence How much is the train or crowd of people allowed to change/shift? IE, if the train changes color or type of train, does it still count? If the number of people changes, or any of their faces/genders/clothing noticeably swaps, does it still count?

@LarsDoucet In what I described the train is in front---visible the entire video. If a visible object changes color or shape, that's a hard fail, right?
Since the point is to test object permanence, if the number of people changes when occluded by the train, that's also a fail. Similarly if it's obvious that a person changed. (For instance, if before the train passes there's a tall guy in a colorful hat, and that guy can't be found afterwords, that would be a fail from my perspective.)
Of course, the resolution is up to the market creator, I'm just proposing a test which I think reasonably probes whether the model "gets" object permanence.
I'm fine, by the way, with needing an extra prompt like "and no people vanish while the train is obstructing the view". However, needing to plug in object permanence manually (by saying "three people are visible before, and three people are visible after") shouldn't count.
Weaker version are possible. For instance, in a video of "bus passes in front of person", the person should have the same appearance before and after. Now there's much less information for the network to keep track of. Any network that fails that cannot reasonably be said to grasp object permanence, I think.
@ScottLawrence I like this test, just wanted to clarify. Because a lot of the video stuff I've been seeing tends to have a lot of shifting of these cases, but some of the newer models seem to be doing at least a bit better on that front. Nice and clear goal in any case.

@ScottLawrence Sounds like a good test to me.
Although I might still resolve yes if it can't do this but can do other similar things, or no if it can do the bus/train tasks but fails in similar tasks most of the time.
I think the "person still exists" level is enough, model has to show it can do basic object permanence, not be perfect at it, so let's say it's fine if people and objects shift a bit as long as they are recognizable as the same kind of person or object.
I'll be lenient whith things shifting but not whith things disappearing.
If it can do easy object permanency tasks whith 1 object but fails whith multiple objects I will likely resolve yes anyway as long as it's consistently good at the easy tasks.

@VictorLevoso Sounds fair.
One of the guiding principles I have is that it's absolutely essential that the model be able to perform object permanence for objects not explicitly mentioned in the prompt. So for instance, if I say "a bus passing in front of a woman wearing a red shirt", then I'm not impressed if before and after, there's a woman wearing a red shirt, since the model doesn't really have to "remember" anything, but can rather just refer back to the prompt.
Whereas, if I say "a bus passing in front of a woman", and the appearance of the woman is nontrivially the same before and after, that actually indicates to me that the model has captured the essence of object permanence. (Even if the shoes look a bit different before and after.)
That's the point of the "crowd of people"---it forces the model to invent a decent amount of information, and then get that information right before and after.
(I'm writing all this up less to convince/browbeat y'all, and more to record for me-in-a-year that this is the thing I doubt will be accomplished. That way I know when to be surprised/impressed.)
(While my NO bet is of course unconditional, I'll be properly surprised if this is done without any "hack", like first asking a language model to generate a detailed description of a scene, and then feeding that description into a video model. In other words, if a model manages to learn object permanence on its own.)


What if it gated behind something that block open testing?
What if it is limited to 20 second long videos?

@M for the first thing I'll decide how to resolve based on the situation and how much info is available.
I'm unsure what to do in case of really short videos.
I actually expect that if we have good video models we'll have videos that are at least a bit longer(I mean transframer is already 30 seconds even if it doesn't seem to be great quality) and that even if its not open stability will release an open source version soon enough, so its likely not going to be a problem.
I guess I'll also ask people how they think the market should be reasonably resolved if its not obvious, and if its still unclear it will just resolve n/a or 50%.
Would something that was "high-quality" but limited in scope like a digital art or animated only version count?

@AlexWilson Lets say no, I guess. It has to at least be able to simultaneously do realistic and some other format like animation.


















Related markets



Related markets


