How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version) | Manifold

How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version)

2

220Ṁ41

2028

5,342

expected

1H

6H

1D

1W

1M

ALL

Resolve when we find out.

If they do not use tokens, resolve NA. This situation seems highly unlikely since OpenAI has repeatedly stated that they used Diffusion Transformers.

We only focus on the latent diffusion model part. If they also used Transformers for the VAE compression, we ignore that part.

For reference:

The Original ViT uses 16 by 16 tokens for a picture of 256* 256 pixels. This architecture did not use VAE to compress to latent space.
Gemini 1.5 Pro uses 300 tokens per second.
LLaVA-UHD uses up to 5k tokens for 4k resolution images.

AI Video Generation

Get

1,000

to start trading!

People are also trading

Was synthetic video data generated and used in training Sora?

-5% 1d23% chance

How many seconds will Sora take to generate 10 seconds of video?

Does Sora use DPO?

Related questions

Was synthetic video data generated and used in training Sora?

How many seconds will Sora take to generate 10 seconds of video?

Does Sora use DPO?

© Manifold Markets, Inc.•Terms•Privacy