Minimum video length of 2 minutes, and must maintain coherence. "Sound" means dialogue and background noise.
The visuals, any dialogue, and sound must all be of "reasonable" quality: it does not need to be indistinguishable from human made video, but there shouldn't be significant artifacts.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ224 | |
2 | Ṁ190 | |
3 | Ṁ55 | |
4 | Ṁ41 | |
5 | Ṁ26 |
People are also trading
I think it's unlikely that we won't have the video part when we already have low quality video of the required length and it looks like it's just a question of scaling things up.
For audio it should be posible to either do a multimodal model that does both things somehow, train a video 2 audio model on YouTube or generate the audio separately from a text2audio model, or separately whith a text2voice model and a video2background-sound model or sonethimg like that.
@VictorLevoso What is your analysis 11 months later as to where things are at? Particularly with the audio integration
@JoshuaHedlund so things have been much slower in terms of video generation than I expected(and other things) , probably partly due to lack of enough gpu slowing things down(bottleneck seems to be Nvidia rather than money)
I still think we might see good enough video towards the end of the year.
Problem is audio, we definitely have good voice generation by now, but I haven't seen any background noise generation thing and it seems less likely for q company to do both before the end of the year. As oposed to just doing high quality video and figuring out audio latter.
I think the main problem rn is probably most people don't have enough compute to train models on video.
However that said we still have some conference deadlines left, and last year a lot if the video generation stuff came out at the end of the year so there's still time for that.