EG "make me a 120 minute Star Trek / Star Wars crossover". It should be more or less comparable to a big-budget studio film, although it doesn't have to pass a full Turing Test as long as it's pretty good. The AI doesn't have to be available to the public, as long as it's confirmed to exist.
A bit rambly, but Ray Kurzweil predicts AI movies within a year or two (minute 50):
@Joshua having both play money with loans and having sweepstakes makes manifold the life of the prediction markets party
@jim to be fair, I’d argue that AGI is a necessary but insufficient condition to generate full-length Hollywood films from a single prompt.
@benshindel The opposite could be argued as well: Generating full-length Hollywood movies from a single prompt is a necessary but insufficient condition for AGI.
I tend more towards your argument, but I'd guess if this market would resolve yes, there'd still be no consensus (outside tech bubble) for "AGI exists".
@jim We won't have AGI in 2028.
But even if we did, it is true that we would not have full length AI generated movies.
@benshindel Correct. That is what I said. We will not have AGI in 2028, but even if we did, we would still not have the movies.
Same with the Turing Test market; even if we have AGI by 2029, it still won't pass the test.
"pretty good" in the resolution criteria is extremely vague and really should be improved given the popularity of this market.
The criteria I bet on in June is "whether I'd give the movie at least a 6/10 on IMDb (which would put it in the top 85% of the 1377 movies I've seen/rated, i.e. not horrible) if I wasn't rating it especially favorably due to being impressed by the fact that it was created by AI (a novelty factor that would wear off quickly after a few AI-generated movies)."
As someone actively trying to build a "text-to-movie" generator, I would like to talk about what I think the SOTA is right now, and where we are going.
Script Generation:
Currently:
Right now, there are only 2 models that are decent at writing: Gemini-1.5-pro and Claude 2.5 sonnet. If you haven't used them, you have no idea what LLMs are capable of. Even still, they require a fair bit of prompt-engineering to make something that doesn't sound like a pile of cliches.
Although Gemini-1.5 does have a large context window (2m tokens), this does not mean you can simply prompt it "write a screenplay". There is a limit on how many tokens the model will output, but more importantly there is a limit on "effective attention". Past a few thousand words, the model will stop listening to the prompt and revert to it's default "helpful assistant" mode (which needless to say is terrible for creative writing).
For now, the best way to get around this is first prompt the LLM for a list of scenes and then generate those scenes one-by-one. This results in stories that mostly make sense, but occasionally have sudden jumps or plot holes.
How far are we:
I feel like maybe if you took Gemini-1.5-pro and fine-tuned it on a few thousand screenplays you would get something that was "good enough". Certainly by 2028 I don't think script-writing will be the bottleneck
Video generation:
Currently:
The currently crop of video models (Sora, Runway-gen3, Kling, seaweed) are just barely getting to the level of photo-realism needed. If you just want a person walking or talking, we are basically there. For complex motions or interactions, however, the current models will only produce monstrously deformed obscenities.
If we had to go with today's video generators, any film is going to be mostly people walking through scenes not interacting with much, or sitting and talking.
How far away are we:
Pretty far, honestly, but this seems like the type of problem that AI has been making a lot of progress on recently. I expect models to get 10x better/year with no real reason to hit a roadblock between now and 2028 (unless we max-out on photorealism). So basically if 1/100 shots are usable right now, it should be 1/10 next year and so on.
Character consistency:
Currently:
There are 2 ways to get consistent characters right now:
You can write a very detailed description of your character and hope they look the same from scene to scene (this works poorly)
You can use an image-to-video and take advantage of the (excellent) tools we have for character consistency in image models (face-ip-adapter, loras), etc. A quick visit to civitai should convince you there is no problem creating photo-realistic images of a specific person.
How far are we:
I would call this a nearly solved problem (using route 2), but I would really like to see character consistency built-in to video models. Ideally with something like omnigen where you can just give it a series of images and say "give me a video where these two characters are ..."
Audio (speaking):
Current models (elevenlabs) sound human, but don't give you a ton of control over things like emotion. I expect with the release of OpenAI's new voice model, this will basically be a solved problem
Audio (foley):
Currently:
There are a handful of models that you can give a video and get an audio track. They are good tech-demos, but not good enough for production (unless you are willing to prompt multiple times until you get the right thing)
How far are we:
The current state of the art (generating video and then audio) is following a stupid approach. There is no reason a transformer can't generate audio and video simultaneously (which will be much better). Probably it will be at least a year or two before people train such a model, though.
Audio (music):
Currently:
The SOTA music gen models (suno, udio) are indistinguishable from professional music to me (not a trained musician).
How far are we:
I cannot imagine this being the bottleneck.
Direction:
Currently:
Even if all of the above problems are solved, you still need some way to combine them into a final product. That is, you need an AI that takes a screenplay, converts it into a sequence of shots, generates video/speech/foley/music for each shot, and produces a final output. Ideally, such an AI will be fine-tuned on human preferences so that it can produce multiple takes of each shot and select the best one. I have written scripts that do a first-approximation of this, but it is not nearly good enough. And I haven't bothered adding video/audio/music (because creating even a 30minute video would cost $100's).
How far are we:
I don't think we will even begin to make progress on this part until the cost of the other parts comes down by 100x or so (so generating a full length film costs a few dollars). I estimate costs are coming down at a rate of 10x/year, so in about 2 years (2026) we will see the first bad ai-generated movies.
End-to-end model:
In an ideal world, we would not use 5 separate models (screenplay, video,speech,foley,music,direction). Instead a single model would be trained end-to-end to produce movies. So far, no one has trained such a model. In order to do so, they would have to overcome the "effective attention" problem (that I mentioned about writing screenplays). Will there be a breakthrough between now and 2028 that allows us to do this? Maybe? Certainly the effective attention of GPT-4 is better than GPT-3 and we should expect this trend to continue. I think if effective-attention is solved, we will basically have AGI, and I feel like 2028 is soonish for that to happen. My timelines are more like 2030-2035.
Key milestones to look for:
2025: the first human-edited video >30min that is actually watchable
2026: the first low-quality fully-ai-generated movies
2027: AI generated movies proliferate across the internet. You cannot scroll youtube without encountering them (some of which have millions of views). Occasionally it takes you a minute or two to realize a movie is ai-generated.
2028: This market resolves positive (idk, certainly I am a buy at 43%, probably not a buy at 90% though).
If we miss these milestones it will be for one of the following reasons:
AI has reached a "cap" on how good it can be (extremely unlikely)
The cost of training ever-larger models means AI progress slows down (plausible, but not the most likely outcome)
Regulation makes ai generated video illegal or requires prohibitive licensing from rights-holders (possible, but I suspect rights-holders will work out some kind of deal)
I have underestimated how hard each of these milestones is to achieve (maybe, idk)
@LoganZoellner I don't share your optimism (in particular, video generation is and will remain expensive), but at least you're listing out most of the required parts, and being honest about the BS of the supposed million token context windows.
One piece you didn't comment on is that this needs to work across all styles. It's not just some random good movie that's required, but a movie to a prompt.
@VitorBosshard I don't anticipate style being a big problem. Again, quick glance at civitai will show AI is very good at doing a number of different styles. The market is a bit ambiguous about how the model will be evaluated, but the example prompt "star trek vs star wars" also implies it won't be judging that strict on style (since it's basically the most generic idea you could suggest). I am assuming a standard of "about as good as the typical crappy hallmark movie".
In terms of price, there is at least a 10x price drop baked in (current models use diffusion, which it is possible to make dramatically faster using a technique called latent-consistency-models). There is another ~4x more or less guaranteed due to Moore's law. I'm assuming 10x/year price drop because that's what has been the case recently. Only 2 years ago, images like this used to take hours to generate, now we can get photorealistic images in <1 second.
@LoganZoellner The writing quality can be even better if you utilize a base model (Llama 405b base, or, for example NovelAI's recent finetune of I believe Llama 70b specifically for writing), though this hampers the directability significantly. The chatbot styles are very obvious, even though I agree with you that Sonnet 3.5 and Gemini are at the top of the pile in writing quality for chatbots. Admittedly for a movie script that matters less than it does for a novel.
If there was a group specifically focused on this (rather than enthusiasts throwing together a hodgepodge of models), I expect they would have the funds to finetune a base model specifically for competent story arc generation, plot points, dialogue, etcetera.
@Aleph llama 405b is much worse at creative writing than claude 3.5 sonnet/gemini-pro and base models don't give you enough control (since to create a movie length script you need structured generation).
I do think it would be cool if there were more of a community effort to fine-tune a model specifically for creative writing. Most "role play" models are focused on a very specific type of writing. Fundamentally, though, I don't think the current generation of models is capable of writing a movie-length script (without structured prompting) due to inherent limitations in how Transformers work (they are highly repetitive and lose track of details quickly).
@LoganZoellner I'm quite surprised that you think they're better. Most writing I get from any chatbot has a noticeably distinctive taste to the style style and tendencies of focus—claude/gemini not as bad as chatgpt—while Llama 405b can, for example, imitate a specific writing style in ways I've never really gotten Claude to manage.
Base models not being instructable is a big problem. There's options like using Claude for managing the overall logic of what should be generated scene-by-scene, with a finetuned Llama 405b/70b base for filling in the dialouge and stylistic details with Claude tweaking or reprompting.
I do think it would be cool if there were more of a community effort to fine-tune a model specifically for creative writing. Most "role play" models are focused on a very specific type of writing.
Yeah, I agree. Even many of the roleplay models have a distinctive way of writing, similar in spirit to ChatGPT/Claude's, and just aren't general enough.
NovelAI is nice in that they trained on a massive amount of typical fiction writing, but unfortunately not open-source nor likely to ever be. Doesn't even have a per-token API.