Will a non-crappy video equivalent of dall-e be published before June 2023?
Jun 1

Based on this tweet https://twitter.com/ArthurB/status/1528991584309624832 Question resolves positive if a model is capable of generating arbitrary videos of reasonable quality from text prompts and demonstrates "object permanency" in the sense that it can resolve full object oclusions correctly(for example a mountain being temporarily hidden by a low cloud should still be there after the cloud moves) most of the time. If it's unclear whether some existing model has the capabilities by the deadline I'll use my judgment to decide how to resolve the market, and will lean towards yes in cases where the model mostly does it correctly for simple videos but fails at cherrypicked edge cases.


For extra clarification published means that at least a paper or some oficial announcement has to come out about it before that date.

Also the fact I haven't resolved yes doesn't necesarily mean that I think none of the stuff that is out yet counts, I'm likely going to wait until June in any case and resolve based on the best model avaliable by then, unless something pretty clearly counts before then (in case someone was updating on how harsh my judgment is based on seeing that I haven't resolved despite whatever model coming out).

Sort by:
toms avatar
Tоmbought Ṁ100 of NO

Nothing released so far has come close to demonstrating object permanence

Butanium avatar
Butaniumis predicting YES at 36%
qumeric avatar
Valery Cherepanovsold Ṁ68 of NO

Emad recently said (Raoul Pal video) that StabilityAI text-to-video will be likely released in May.

Mason avatar
GPT-PBotbought Ṁ10 of YES

A machine that creates art,
With a single click to start,
But can it make videos that sing?
Only time will tell, ding-a-ling.

NcyRocks avatar
N.C. Youngis predicting YES at 36%

@Mason Bit of a reach, don't you think?

Sky avatar
Skyis predicting YES at 43%

What does published mean? Must it just be demonstrated that such a model existed (even with limited access) before June 2023?

VictorLevoso avatar
Victor Levoso

If theres a closed beta and I get enough info from people trying the model to say the model gets object permanency mostly right this resolves yes.
If for example OpenAI had such a model(witch might already be the case) but they don't release it in any way until after that date then it doesn't count and resolves yes.
If theres's a paper but people cant try the model but then we are in unclear territory and

VictorLevoso avatar
Victor Levoso

@VictorLevoso ups clicked send before finishing the message:
If there's a paper/blog/anouncement before that date and lots of disagreement among about whether the tests on it show it counts or are just tcherripicked and the market doesn't have any strong opinion either way I'll either resolve N/A or use my own judgment to decide. if its clear that it does.

VictorLevoso avatar
Victor Levoso

@VictorLevoso also obsl meant resolves no after.

"If for example OpenAI had such a model."

Ugh kind of anoying not being able to edit

cloudprism avatar
Hayden Jackson
Sky avatar
Skyis predicting YES at 37%
qumeric avatar
Valery Cherepanovis predicting NO at 52%

@Sky I think the snippets in the video might qualify as non-crappy but they are probably cherry-picked and average performance will be significantly worse.

ErickBall avatar
Erick Ballis predicting NO at 52%

@ValeryCherepanov they are also very short.

Oldmeme avatar
Old memebought Ṁ25 of YES

@ErickBall Length isn't part of the stated resolution criteria, is it?

ErickBall avatar
Erick Ballis predicting NO at 45%

@Oldmeme not explicitly, but if it's only a few frames then it's likely to be subjectively "crappy" and it will also be quite difficult to convincingly demonstrate object permanence.

Sky avatar
Skyis predicting YES at 47%
EduardoFilippi avatar
Eduardo Filippibought Ṁ50 of NO

if it didn't include the object permanence bit I'd say maybe, but the problem as described here would require a ridiculous amount of computing power to run (let alone train)

HenriThunberg avatar
Henri Thunbergis predicting YES at 22%

@cherrvak certainly crappy 😃 But there are enough people working on it to have one of them do significant progress before June 😌

NiclasKupper avatar
Niclas Kupperis predicting YES at 25%

This seems like a strong contender: https://arxiv.org/abs/2302.07685

SamuelRichardson avatar
Samis predicting NO at 25%

@NiclasKupper Look about where some of the image generation work was two or three years before it really took off.

Somewhat related market with slightly longer timeframe: https://manifold.markets/SamuelRichardson/will-you-be-able-to-use-ai-tools-to

Michael avatar
Michaelis predicting NO at 25%

@NiclasKupper Some example results here: https://sihyun.me/PVDM. (Though note that it does not produce these videos from text prompts)

RealityQuotient avatar
Reality Quotientis predicting YES at 31%

Does this count? https://twitter.com/runwayml/status/1622594989384519682?s=46&t=xG3-unwphscZW3CuBHcpMQ

If not, what's missing from this example that would be needed?

PatrickDelaney avatar
Patrick Delaneyis predicting NO at 27%

@RealityQuotient It's really good, but it's not arbitrary, it's video-to-video. Right?

RealityQuotient avatar
Reality Quotientis predicting YES at 31%
cherrvak avatar
Konstantine Sadovis predicting YES at 32%

@cherrvak wait this is video-to-video I screwed up =_=

ScottLawrence avatar
Scott Lawrencesold Ṁ731 of NO

Closed out my position, primarily out of fear of the precise resolution criteria, which haven't been clarified. (See my comments below.)

qumeric avatar
Valery Cherepanovis predicting NO at 26%
LarsDoucet avatar
Lars Doucetbought Ṁ50 of YES
jonsimon avatar
Jon Simonis predicting NO at 27%

@LarsDoucet ngl still looks pretty crappy

LarsDoucet avatar
Lars Doucetis predicting YES at 27%

@jonsimon Definitely got object permanence tho

jonsimon avatar
Jon Simonis predicting NO at 27%

@LarsDoucet kind of, that piece of metal the robot is playing with doesn't really seem to have a consistent size

toms avatar
Tоmbought Ṁ181 of NO

@LarsDoucet Looks like it needs an input image or video, not just a text prompt. Also, it doesn't look like it handles object occlusions well

EdwardKmett avatar
Edward Kmett

@jonsimon It isn't great yet, but they do seem to resolve the 'switching leg' problem by and large by enforcing a sort of spatial coherence.

cherrvak avatar
Konstantine Sadovis predicting YES at 36%

"For one thing, in addition to ChatGPT and the outfit’s popular digital art generator, DALL-E, Altman confirmed that a video model is also coming, though he said that he “wouldn’t want to make a competent prediction about when,” adding that “it could be pretty soon; it’s a legitimate research project. It could take a while.”" https://techcrunch.com/2023/01/17/that-microsoft-deal-isnt-exclusive-video-is-coming-and-more-from-openai-ceo-sam-altman/

jonsimon avatar
Jon Simon

How long do the generates video clips have to be?

jonsimon avatar
Jon Simon

@jonsimon *generated

cherrvak avatar
Konstantine Sadovis predicting YES at 60%
toms avatar
Tоmbought Ṁ30 of NO

@NikitaBrancatisano Seems like the videos are generated by creating continuations of short videos, rather than being generated by text prompt

NikitaBrancatisano avatar
Nikita Brancatisanois predicting YES at 66%

@TomShlomi You're right! I missed the text prompt bit 😔

Still, looks very promising

cherrvak avatar
Konstantine Sadovis predicting YES at 50%

this is looking a little more impressive: https://yingqinghe.github.io/LVDM/

JoyVoid avatar
joy_void_joyis predicting YES at 68%

@ValeryCherepanov Weird, I'd have thought they'd prioritize this. I'm unclear how much I should trust this estimate and how much to update on if they are working on it right now

JaimeSevilla avatar
Jaime Sevillabought Ṁ200 of YES

Making of Spiderverse fan video.

Lots of manual editing, and not quite "text-to-video".

But quite promising, specially with regards to frame-to-frame visual consistency.

JaimeSevilla avatar
Jaime Sevillais predicting YES at 76%

@JaimeSevilla I might have gotten too excited.

The frame to frame consistency in the actual video is done through optical flow reconstruction, not stableAI like in the first clip shown

RinaRazh avatar
Rina Razhbought Ṁ10 of YES

Still crappy but the dog eating the ice cream weirded me out for sure https://www.creativebloq.com/news/meta-ai-video-generator

ms avatar
Mikhail Saminis predicting YES at 80%

I just realized there’s “June 2023”, not “January 2023”

Ryu18 avatar
Ryu18is predicting YES at 79%
FutureOwl avatar
Future Telling Owlis predicting NO at 79%

The recent progress is definitely more than I expected so soon, but the question asked for object permanency, and we haven't seen anything that looks like that, or even clips long enough to demonstrate it.

Gigacasting avatar

Phenaki shows very strong grasp of object perspective, rotation, zoom, layering, and permanency.

One can debate whether it is “crappy” (vastly less so than Dalle-1/mini) or “reasonable quality” (maybe) but it conditions on all prior frames and clearly passed the permanence tests.

JoyVoid avatar
joy_void_joyis predicting YES at 67%

@Gigacasting Hard disagree. It is pretty clear from its footage that it's just relying on the prompt to infer what should be there (see how the teddy bear never leaves water, but the scene just transition in a blurred fashion)

ScottLawrence avatar
Scott Lawrenceis predicting NO at 67%

@JoyVoid This is why I was pushing for a pre-committed, specific test for object permanency. We're about to get bogged down into "what counts".

From an outside view, every debate about AI progress has the appearance of massive goalpost-moving (on both sides). Very frustrating.

FutureOwl avatar
Future Telling Owlis predicting NO at 71%

I think the example from the market description (mountain being hidden by a cloud, then appearing after) should be agreeable? We have most of a year for someone to publish a model and someone else to try that prompt.

For examples of missing object permanency, the shape of the treeline noticeably changes when the rider crosses them in the top-center and lower-right images here.

A full occlusion of an object would be ideal. With a partial occlusion, it's hard to tell the difference between whether the model remembered the object (has object permanency) versus the model managed to infer a good enough filling from the parts of the model it sees (is really good at inpainting).

ScottLawrence avatar
Scott Lawrencebought Ṁ60 of NO

@FutureOwl I can't tell if I'm being pedantic and repetitive, or genuinely poking at an important issue. I apologize if it's the former.

As written in the market description, I think there's an important ambiguity. If the prompt is "a cloud passes in front of a mountain", then the test described, that "the mountain should still be there", is not in any way a test of object permanence. Of course the mountain is still there---the model can just read the prompt to see that there should be a mountain still there!

In short: it's very important that the thing that is permanent be something that isn't in the prompt, and is not inferable from the prompt+single previous frame.

ScottLawrence avatar
Scott Lawrenceis predicting NO at 67%

@ScottLawrence As an aside (after watching those videos more closely), I also consider it an object permanence fail if the number of legs the horse has keeps changing!

But that's just beating a dead five-legged horse.

Gigacasting avatar
Gigacastingsold Ṁ51 of NO

Mostly pedantic 🤔

The description asks for “most of the time”; it obviously passes that test in the videos and especially the longer video.

Picking the worst frame violates everything about the definitions (cherry picking edge cases)

Not opposed to calling this “crappy” but it’s missing the point entirely (or a massive difference in what the word “most” means 🤔) to focus on small objects or tree shape when he literally asked for “mountains” to not fully disappear “most of the time”

(No position as it’s very reasonable to go with either result.)

FutureOwl avatar
Future Telling Owlis predicting NO at 80%

@Gigacasting I actually picked those trees because they were the only ones I could find where the horse went far enough that there was a full occlusion.

Gigacasting avatar
Gigacastingsold Ṁ30 of NO

Revising: Phenaki with any post-processing achieves this.

While Deepmind has stayed away from the text-image hype race, and OpenAI has been closed off and uncreative, Stable Diffusion clearly sparked something, and people are finally being clever in architectural design and compute efficiency.

Even a moderate combo of Phenaki with the final layers from the Meta paper will produce this.

Gigacasting avatar
Gigacastingis predicting NO at 56%


Gigacasting avatar
Gigacastingis predicting NO at 56%

👑 anon paper

Vs. 🤮 meta hype

VictorLevoso avatar
Victor Levoso

@Gigacasting Oh yeah I also just saw that paper on twitter, came here to see if someone had mentioned it.

Seems like Iclr is going to have lots of image2text stuff apparently.

Also after skimimg the meta paper while I think there's no clear distinction between gifs and videos , I do agree that only being able to basically specify the first frame of the video and have the video basically be interpolation from that is a big limitation and arguably doesn't fit the spirit of "arbitrary video" maybe.

But this anon paper doesn't seem to have that problem from looking at the abstract and examples so don't think whether that counts that will really matter for the resolution.

(not that I'm going to resolve yet but thinking about what my criterion will be).

Gigacasting avatar
Gigacastingbought Ṁ90 of NO

So crappy they immediately ghosted their own Twitter account:

Gigacasting avatar
Gigacastingis predicting NO at 56%

These are not only NOT “arbitrary videos” they aren’t even videos at all.

They turn a prompt into an image, then do slight animation around it (similar to iPhone “Live” photos with a few bordering frames)

Very far from resolution—this is high-res gif animation and bears no resemblance to video

mkualquiera avatar
mkualquierabought Ṁ180 of YES

gif animations are literally videos lololol

JaimeSevilla avatar
Jaime Sevillabought Ṁ300 of YES
JoyVoid avatar
joy_void_joybought Ṁ500 of YES

It's still pretty bad with respect to the question, but this is Dalle1 level of crappy. Would be very surprised we don't reach stablediffusion level by June 2023 (I am less sure about an open-source/toxiccandy model by then)

mkualquiera avatar
mkualquierabought Ṁ200 of YES

@JoyVoid then why are you betting 500 on YES?

mkualquiera avatar
mkualquierais predicting YES at 74%

Apologies, I read that wrong XD

LuisMartinez avatar
Luis Mbought Ṁ12 of NO

@JaimeSevilla Good demo, but still in crappy territory. Hoping to see a lot of progress in the near future. If you read the paper, you will see the limitations of this approach, mainly because we have GPUs with low memory. So this is not the path if we are hoping to get text-video with the current hardware.

VictorLevoso avatar
Victor Levoso

@JaimeSevilla Yeah I saw that.

The demo looks nice but I'm not going to resolve yes yet.

Like it's possible that their model is good enough already but I'm fine whith waiting a few months untill we have better models and is more obious, especially if stability releases an open version I can play arround whith or something.

If we somehow don't get any better video models before the close date I'll think hard about it and ask people whether they think it should resolve based on this demo but that seems very unlikely.

JoyVoid avatar
joy_void_joyis predicting YES at 54%

@VictorLevoso In my opinion, this should not resolve yes yet, with respect to the "non-crappy" part of your question

VictorLevoso avatar
Victor Levoso

@JoyVoid I think I mostly agree yeah and it does feel more like dalle-1 and looking at the paper seems it has some limitations arround how prompts work.

It is interesting that it seems to not fail obiously at object permanecy specifically but we also only have 4 potentially cherripicked 5 second videos so.

Btw to be clear I don't require an open source version to be released to resolve yes, it will just make my job easier and the resolution less contentious.

ScottLawrence avatar
Scott Lawrencebought Ṁ200 of NO

On the basis of the "object permanency" requirement, I'm betting NO. I do think it would be best if we could agree on a set of test cases, particularly in light of SA's recent claim to have effectively won a bet of this sort.

Can I suggest: a video of a crowd of people, with a car (or train) passing in front.

LarsDoucet avatar
Lars Doucetis predicting YES at 30%

@ScottLawrence How much is the train or crowd of people allowed to change/shift? IE, if the train changes color or type of train, does it still count? If the number of people changes, or any of their faces/genders/clothing noticeably swaps, does it still count?

ScottLawrence avatar
Scott Lawrenceis predicting NO at 30%

@LarsDoucet In what I described the train is in front---visible the entire video. If a visible object changes color or shape, that's a hard fail, right?

Since the point is to test object permanence, if the number of people changes when occluded by the train, that's also a fail. Similarly if it's obvious that a person changed. (For instance, if before the train passes there's a tall guy in a colorful hat, and that guy can't be found afterwords, that would be a fail from my perspective.)

Of course, the resolution is up to the market creator, I'm just proposing a test which I think reasonably probes whether the model "gets" object permanence.

I'm fine, by the way, with needing an extra prompt like "and no people vanish while the train is obstructing the view". However, needing to plug in object permanence manually (by saying "three people are visible before, and three people are visible after") shouldn't count.

Weaker version are possible. For instance, in a video of "bus passes in front of person", the person should have the same appearance before and after. Now there's much less information for the network to keep track of. Any network that fails that cannot reasonably be said to grasp object permanence, I think.

LarsDoucet avatar
Lars Doucetis predicting YES at 34%

@ScottLawrence I like this test, just wanted to clarify. Because a lot of the video stuff I've been seeing tends to have a lot of shifting of these cases, but some of the newer models seem to be doing at least a bit better on that front. Nice and clear goal in any case.

VictorLevoso avatar
Victor Levoso

@ScottLawrence Sounds like a good test to me.

Although I might still resolve yes if it can't do this but can do other similar things, or no if it can do the bus/train tasks but fails in similar tasks most of the time.

I think the "person still exists" level is enough, model has to show it can do basic object permanence, not be perfect at it, so let's say it's fine if people and objects shift a bit as long as they are recognizable as the same kind of person or object.

I'll be lenient whith things shifting but not whith things disappearing.

If it can do easy object permanency tasks whith 1 object but fails whith multiple objects I will likely resolve yes anyway as long as it's consistently good at the easy tasks.

ScottLawrence avatar
Scott Lawrenceis predicting NO at 34%

@VictorLevoso Sounds fair.

One of the guiding principles I have is that it's absolutely essential that the model be able to perform object permanence for objects not explicitly mentioned in the prompt. So for instance, if I say "a bus passing in front of a woman wearing a red shirt", then I'm not impressed if before and after, there's a woman wearing a red shirt, since the model doesn't really have to "remember" anything, but can rather just refer back to the prompt.

Whereas, if I say "a bus passing in front of a woman", and the appearance of the woman is nontrivially the same before and after, that actually indicates to me that the model has captured the essence of object permanence. (Even if the shoes look a bit different before and after.)

That's the point of the "crowd of people"---it forces the model to invent a decent amount of information, and then get that information right before and after.

(I'm writing all this up less to convince/browbeat y'all, and more to record for me-in-a-year that this is the thing I doubt will be accomplished. That way I know when to be surprised/impressed.)

(While my NO bet is of course unconditional, I'll be properly surprised if this is done without any "hack", like first asking a language model to generate a detailed description of a scene, and then feeding that description into a video model. In other words, if a model manages to learn object permanence on its own.)

clarity avatar
Broke Sinclair


Accountdeletionrequested avatar
Account deletion requestedbought Ṁ6 of NO

What if it gated behind something that block open testing?

What if it is limited to 20 second long videos?

VictorLevoso avatar
Victor Levoso

@M for the first thing I'll decide how to resolve based on the situation and how much info is available.
I'm unsure what to do in case of really short videos.
I actually expect that if we have good video models we'll have videos that are at least a bit longer(I mean transframer is already 30 seconds even if it doesn't seem to be great quality) and that even if its not open stability will release an open source version soon enough, so its likely not going to be a problem.
I guess I'll also ask people how they think the market should be reasonably resolved if its not obvious, and if its still unclear it will just resolve n/a or 50%.

AlexWilson avatar
Alex Wilson

Would something that was "high-quality" but limited in scope like a digital art or animated only version count?

VictorLevoso avatar
Victor Levoso

@AlexWilson Lets say no, I guess. It has to at least be able to simultaneously do realistic and some other format like animation.

JoyVoid avatar
joy_void_joybought Ṁ500 of YEShttps://nuwa-infinity.microsoft.com/ Seems like there's good progress on this
Gigacasting avatar
Gigacasting Very difficult.
Gigacasting avatar
Gigacastingis predicting NO at 42% Atari-quality cartoons maybe. But the sheer amount of compute per frame of Dalle2 is extraordinary, and video would require architectural improvements and one or two orders of magnitude more spend.
VictorLevoso avatar
Victor Levoso @Gigacasting I want to note that I disagree and would buy some yes if I wasn't the owner of the market or it didn't depend on my own judgment. I think that while it's posible that it takes 2 years instead of 1 but I think we already mostly have the architecture necesary, and compute avaliable plus willingness to expend is increasing so I expect that either OpenAI or someone else is going to start working on it if they haven't started already, especially whith nvidia 's h100. Videos might be pretty short at first though
MP avatar
MPis predicting NO at 41% @VictorLevoso You'll lose because it's simply too early. I don't think there will be enough GPUs to do that before June 2023. But as it seems, a video equivalent is inevitable. I don't think we even have the crappy version!
JoyVoid avatar
joy_void_joy @MP Depends what you mean by crappy, an abstract version of it certainly [exists](https://www.youtube.com/watch?v=0fDJXmqdN-A) Though I agree it is quite far away from the realisticness of Dall-e
VictorLevoso avatar
Victor Levoso @MP we do seem to have have some things that are kind of a crappy video generation already( https://www.gwern.net/docs/ai/video/generation/index) And we went from igpt wich was super crappy to dall-e 2 in two years and AI progress seems to be getting faster not slower. And VPT seems like a lot of evicence of video soon to me, understanding minecraft seems like it requires understanding video and their method could probably generate minecraft videos by just changing it a bit. Plus yeah video is a lot harder than images but for now we are riding multiple exponential s and so I expect it to come unintuitively fast. That said I wouldn't bet super high here, cause like maybe it does take untill 2024 instead or like winter 2023, it's hard to tell, but I would bet very high if someone makes a similar market whith 2025 close date or something like that.
MP avatar
MPis predicting NO at 41% @JoyVoid Interest, wasn't aware of that!