OpenAI has just announced a new text-to-video model known as Sora which has unprecedented visual quality and object permanence.
This question asks: what company will first release an open-weights text-to-video model (or image-to-video model) with fidelity equal to or greater than Sora.
In order to resolve this question positive the model must be open-weights, meaning anyone can download the model weights (possibly after signing a disclaimer), but need not be open-source. For example it could be research-only or restricted for commercial use.
Notable existing open weights video generation include:
Stable Video Diffusion: Stability AI
Hotshot XL: Natural Synthetics Inc.
Animate LCM: Shanghai AI Lab
I2VGen-XL: Alibaba
ByteDance: MagicAnimate
ModelScope: Modelscope text-to-video
(new answers can be added to this question)
Judgement of quality will be my personal judgement, unless OpenAI releases official scores (for example video FID) of Sora's performance. In order to resolve positive, a model must at a minimum: produce videos of length >=60s, demonstrate object-permanence, most of the time generated humans and animals have the correct number of arms/legs/fingers.
black forest labs, creator of Flux claims to be training a video model
https://blackforestlabs.ai/up-next/