Will a non-crappy video equivalent of dall-e be published before June 2023?
61%
chance
Based on this tweet https://twitter.com/ArthurB/status/1528991584309624832 Question resolves positive if a model is capable of generating arbitrary videos of reasonable quality from text prompts and demonstrates "object permanency" in the sense that it can resolve full object oclusions correctly(for example a mountain being temporarily hidden by a low cloud should still be there after the cloud moves) most of the time. If it's unclear whether some existing model has the capabilities by the deadline I'll use my judgment to decide how to resolve the market, and will lean towards yes in cases where the model mostly does it correctly for simple videos but fails at cherrypicked edge cases.
Sort by:
TomShlomi avatar
Tom Shlomi
bought Ṁ30 of NO

@NikitaBrancatisano Seems like the videos are generated by creating continuations of short videos, rather than being generated by text prompt

NikitaBrancatisano avatar
Nikita Brancatisano
is predicting YES at 66%

@TomShlomi You're right! I missed the text prompt bit 😔

Still, looks very promising

cherrvak avatar
Konstantine Sadov
is predicting YES at 50%

this is looking a little more impressive: https://yingqinghe.github.io/LVDM/

JoyVoid avatar
joy_void_joy
is predicting YES at 68%

@ValeryCherepanov Weird, I'd have thought they'd prioritize this. I'm unclear how much I should trust this estimate and how much to update on if they are working on it right now

JaimeSevilla avatar
Jaime Sevilla
bought Ṁ200 of YES

Making of Spiderverse fan video.

Lots of manual editing, and not quite "text-to-video".

But quite promising, specially with regards to frame-to-frame visual consistency.
https://www.youtube.com/watch?v=QBWVHCYZ_Zs&t=1s

JaimeSevilla avatar
Jaime Sevilla
is predicting YES at 76%

@JaimeSevilla I might have gotten too excited.

The frame to frame consistency in the actual video is done through optical flow reconstruction, not stableAI like in the first clip shown

RinaRazh avatar
Rina Razh
bought Ṁ10 of YES

Still crappy but the dog eating the ice cream weirded me out for sure https://www.creativebloq.com/news/meta-ai-video-generator

ms avatar
Mikhail Samin
is predicting YES at 80%

I just realized there’s “June 2023”, not “January 2023”

Ryu18 avatar
Ryu18
is predicting YES at 79%
FutureOwl avatar
Future Telling Owl
is predicting NO at 79%

The recent progress is definitely more than I expected so soon, but the question asked for object permanency, and we haven't seen anything that looks like that, or even clips long enough to demonstrate it.

Gigacasting avatar

Phenaki shows very strong grasp of object perspective, rotation, zoom, layering, and permanency.

One can debate whether it is “crappy” (vastly less so than Dalle-1/mini) or “reasonable quality” (maybe) but it conditions on all prior frames and clearly passed the permanence tests.

JoyVoid avatar
joy_void_joy
is predicting YES at 67%

@Gigacasting Hard disagree. It is pretty clear from its footage that it's just relying on the prompt to infer what should be there (see how the teddy bear never leaves water, but the scene just transition in a blurred fashion)

ScottLawrence avatar
Scott Lawrence
is predicting NO at 67%

@JoyVoid This is why I was pushing for a pre-committed, specific test for object permanency. We're about to get bogged down into "what counts".

From an outside view, every debate about AI progress has the appearance of massive goalpost-moving (on both sides). Very frustrating.

FutureOwl avatar
Future Telling Owl
is predicting NO at 71%

I think the example from the market description (mountain being hidden by a cloud, then appearing after) should be agreeable? We have most of a year for someone to publish a model and someone else to try that prompt.

For examples of missing object permanency, the shape of the treeline noticeably changes when the rider crosses them in the top-center and lower-right images here.

A full occlusion of an object would be ideal. With a partial occlusion, it's hard to tell the difference between whether the model remembered the object (has object permanency) versus the model managed to infer a good enough filling from the parts of the model it sees (is really good at inpainting).

ScottLawrence avatar
Scott Lawrence
bought Ṁ60 of NO

@FutureOwl I can't tell if I'm being pedantic and repetitive, or genuinely poking at an important issue. I apologize if it's the former.

As written in the market description, I think there's an important ambiguity. If the prompt is "a cloud passes in front of a mountain", then the test described, that "the mountain should still be there", is not in any way a test of object permanence. Of course the mountain is still there---the model can just read the prompt to see that there should be a mountain still there!

In short: it's very important that the thing that is permanent be something that isn't in the prompt, and is not inferable from the prompt+single previous frame.

ScottLawrence avatar
Scott Lawrence
is predicting NO at 67%

@ScottLawrence As an aside (after watching those videos more closely), I also consider it an object permanence fail if the number of legs the horse has keeps changing!

But that's just beating a dead five-legged horse.

Gigacasting avatar
Gigacasting
sold Ṁ51 of NO

Mostly pedantic 🤔

The description asks for “most of the time”; it obviously passes that test in the videos and especially the longer video.

Picking the worst frame violates everything about the definitions (cherry picking edge cases)

Not opposed to calling this “crappy” but it’s missing the point entirely (or a massive difference in what the word “most” means 🤔) to focus on small objects or tree shape when he literally asked for “mountains” to not fully disappear “most of the time”

(No position as it’s very reasonable to go with either result.)

FutureOwl avatar
Future Telling Owl
is predicting NO at 80%

@Gigacasting I actually picked those trees because they were the only ones I could find where the horse went far enough that there was a full occlusion.

Gigacasting avatar
Gigacasting
sold Ṁ30 of NO

Revising: Phenaki with any post-processing achieves this.

While Deepmind has stayed away from the text-image hype race, and OpenAI has been closed off and uncreative, Stable Diffusion clearly sparked something, and people are finally being clever in architectural design and compute efficiency.

Even a moderate combo of Phenaki with the final layers from the Meta paper will produce this.

Gigacasting avatar
Gigacasting
is predicting NO at 56%

Better:

Gigacasting avatar
Gigacasting
is predicting NO at 56%

👑 anon paper

Vs. 🤮 meta hype

VictorLevoso avatar

@Gigacasting Oh yeah I also just saw that paper on twitter, came here to see if someone had mentioned it.

Seems like Iclr is going to have lots of image2text stuff apparently.

Also after skimimg the meta paper while I think there's no clear distinction between gifs and videos , I do agree that only being able to basically specify the first frame of the video and have the video basically be interpolation from that is a big limitation and arguably doesn't fit the spirit of "arbitrary video" maybe.

But this anon paper doesn't seem to have that problem from looking at the abstract and examples so don't think whether that counts that will really matter for the resolution.

(not that I'm going to resolve yet but thinking about what my criterion will be).

Gigacasting avatar
Gigacasting
bought Ṁ90 of NO

So crappy they immediately ghosted their own Twitter account:

Gigacasting avatar
Gigacasting
is predicting NO at 56%

These are not only NOT “arbitrary videos” they aren’t even videos at all.

They turn a prompt into an image, then do slight animation around it (similar to iPhone “Live” photos with a few bordering frames)

Very far from resolution—this is high-res gif animation and bears no resemblance to video

mkualquiera avatar
mkualquiera
bought Ṁ180 of YES

gif animations are literally videos lololol

JaimeSevilla avatar
Jaime Sevilla
bought Ṁ300 of YES
JoyVoid avatar
joy_void_joy
bought Ṁ500 of YES

It's still pretty bad with respect to the question, but this is Dalle1 level of crappy. Would be very surprised we don't reach stablediffusion level by June 2023 (I am less sure about an open-source/toxiccandy model by then)

mkualquiera avatar
mkualquiera
bought Ṁ200 of YES

@JoyVoid then why are you betting 500 on YES?

mkualquiera avatar
mkualquiera
is predicting YES at 74%

Apologies, I read that wrong XD

LuisMartinez avatar
Luis M
bought Ṁ12 of NO

@JaimeSevilla Good demo, but still in crappy territory. Hoping to see a lot of progress in the near future. If you read the paper, you will see the limitations of this approach, mainly because we have GPUs with low memory. So this is not the path if we are hoping to get text-video with the current hardware.

VictorLevoso avatar

@JaimeSevilla Yeah I saw that.

The demo looks nice but I'm not going to resolve yes yet.

Like it's possible that their model is good enough already but I'm fine whith waiting a few months untill we have better models and is more obious, especially if stability releases an open version I can play arround whith or something.

If we somehow don't get any better video models before the close date I'll think hard about it and ask people whether they think it should resolve based on this demo but that seems very unlikely.

JoyVoid avatar
joy_void_joy
is predicting YES at 54%

@VictorLevoso In my opinion, this should not resolve yes yet, with respect to the "non-crappy" part of your question

VictorLevoso avatar

@JoyVoid I think I mostly agree yeah and it does feel more like dalle-1 and looking at the paper seems it has some limitations arround how prompts work.

It is interesting that it seems to not fail obiously at object permanecy specifically but we also only have 4 potentially cherripicked 5 second videos so.

Btw to be clear I don't require an open source version to be released to resolve yes, it will just make my job easier and the resolution less contentious.

ScottLawrence avatar
Scott Lawrence
bought Ṁ200 of NO

On the basis of the "object permanency" requirement, I'm betting NO. I do think it would be best if we could agree on a set of test cases, particularly in light of SA's recent claim to have effectively won a bet of this sort.

Can I suggest: a video of a crowd of people, with a car (or train) passing in front.

LarsDoucet avatar
Lars Doucet
is predicting YES at 30%

@ScottLawrence How much is the train or crowd of people allowed to change/shift? IE, if the train changes color or type of train, does it still count? If the number of people changes, or any of their faces/genders/clothing noticeably swaps, does it still count?

ScottLawrence avatar
Scott Lawrence
is predicting NO at 30%

@LarsDoucet In what I described the train is in front---visible the entire video. If a visible object changes color or shape, that's a hard fail, right?

Since the point is to test object permanence, if the number of people changes when occluded by the train, that's also a fail. Similarly if it's obvious that a person changed. (For instance, if before the train passes there's a tall guy in a colorful hat, and that guy can't be found afterwords, that would be a fail from my perspective.)

Of course, the resolution is up to the market creator, I'm just proposing a test which I think reasonably probes whether the model "gets" object permanence.

I'm fine, by the way, with needing an extra prompt like "and no people vanish while the train is obstructing the view". However, needing to plug in object permanence manually (by saying "three people are visible before, and three people are visible after") shouldn't count.

Weaker version are possible. For instance, in a video of "bus passes in front of person", the person should have the same appearance before and after. Now there's much less information for the network to keep track of. Any network that fails that cannot reasonably be said to grasp object permanence, I think.

LarsDoucet avatar
Lars Doucet
is predicting YES at 34%

@ScottLawrence I like this test, just wanted to clarify. Because a lot of the video stuff I've been seeing tends to have a lot of shifting of these cases, but some of the newer models seem to be doing at least a bit better on that front. Nice and clear goal in any case.

VictorLevoso avatar

@ScottLawrence Sounds like a good test to me.

Although I might still resolve yes if it can't do this but can do other similar things, or no if it can do the bus/train tasks but fails in similar tasks most of the time.

I think the "person still exists" level is enough, model has to show it can do basic object permanence, not be perfect at it, so let's say it's fine if people and objects shift a bit as long as they are recognizable as the same kind of person or object.

I'll be lenient whith things shifting but not whith things disappearing.

If it can do easy object permanency tasks whith 1 object but fails whith multiple objects I will likely resolve yes anyway as long as it's consistently good at the easy tasks.

ScottLawrence avatar
Scott Lawrence
is predicting NO at 34%

@VictorLevoso Sounds fair.

One of the guiding principles I have is that it's absolutely essential that the model be able to perform object permanence for objects not explicitly mentioned in the prompt. So for instance, if I say "a bus passing in front of a woman wearing a red shirt", then I'm not impressed if before and after, there's a woman wearing a red shirt, since the model doesn't really have to "remember" anything, but can rather just refer back to the prompt.

Whereas, if I say "a bus passing in front of a woman", and the appearance of the woman is nontrivially the same before and after, that actually indicates to me that the model has captured the essence of object permanence. (Even if the shoes look a bit different before and after.)

That's the point of the "crowd of people"---it forces the model to invent a decent amount of information, and then get that information right before and after.

(I'm writing all this up less to convince/browbeat y'all, and more to record for me-in-a-year that this is the thing I doubt will be accomplished. That way I know when to be surprised/impressed.)

(While my NO bet is of course unconditional, I'll be properly surprised if this is done without any "hack", like first asking a language model to generate a detailed description of a scene, and then feeding that description into a video model. In other words, if a model manages to learn object permanence on its own.)

clarity avatar

Related:

Accountdeletionrequested avatar
Account deletion requested
bought Ṁ6 of NO

What if it gated behind something that block open testing?

What if it is limited to 20 second long videos?

VictorLevoso avatar

@M for the first thing I'll decide how to resolve based on the situation and how much info is available.
I'm unsure what to do in case of really short videos.
I actually expect that if we have good video models we'll have videos that are at least a bit longer(I mean transframer is already 30 seconds even if it doesn't seem to be great quality) and that even if its not open stability will release an open source version soon enough, so its likely not going to be a problem.
I guess I'll also ask people how they think the market should be reasonably resolved if its not obvious, and if its still unclear it will just resolve n/a or 50%.

AlexWilson avatar

Would something that was "high-quality" but limited in scope like a digital art or animated only version count?

VictorLevoso avatar

@AlexWilson Lets say no, I guess. It has to at least be able to simultaneously do realistic and some other format like animation.

JoyVoid avatar
joy_void_joy
bought Ṁ500 of YES
https://nuwa-infinity.microsoft.com/ Seems like there's good progress on this
Gigacasting avatar
Very difficult.
Gigacasting avatar
Gigacasting
is predicting NO at 42%
Atari-quality cartoons maybe. But the sheer amount of compute per frame of Dalle2 is extraordinary, and video would require architectural improvements and one or two orders of magnitude more spend.
VictorLevoso avatar
@Gigacasting I want to note that I disagree and would buy some yes if I wasn't the owner of the market or it didn't depend on my own judgment. I think that while it's posible that it takes 2 years instead of 1 but I think we already mostly have the architecture necesary, and compute avaliable plus willingness to expend is increasing so I expect that either OpenAI or someone else is going to start working on it if they haven't started already, especially whith nvidia 's h100. Videos might be pretty short at first though
MP avatar
MP
is predicting NO at 41%
@VictorLevoso You'll lose because it's simply too early. I don't think there will be enough GPUs to do that before June 2023. But as it seems, a video equivalent is inevitable. I don't think we even have the crappy version!
JoyVoid avatar
@MP Depends what you mean by crappy, an abstract version of it certainly [exists](https://www.youtube.com/watch?v=0fDJXmqdN-A) Though I agree it is quite far away from the realisticness of Dall-e
VictorLevoso avatar
@MP we do seem to have have some things that are kind of a crappy video generation already( https://www.gwern.net/docs/ai/video/generation/index) And we went from igpt wich was super crappy to dall-e 2 in two years and AI progress seems to be getting faster not slower. And VPT seems like a lot of evicence of video soon to me, understanding minecraft seems like it requires understanding video and their method could probably generate minecraft videos by just changing it a bit. Plus yeah video is a lot harder than images but for now we are riding multiple exponential s and so I expect it to come unintuitively fast. That said I wouldn't bet super high here, cause like maybe it does take untill 2024 instead or like winter 2023, it's hard to tell, but I would bet very high if someone makes a similar market whith 2025 close date or something like that.
MP avatar
MP
is predicting NO at 41%
@JoyVoid Interest, wasn't aware of that!