Definitions
Modalities: This market considers four key modalities: Image, Audio, Video, Text.
Any Input-Output Combination: The AI should be versatile enough to accept any mixture of these modalities as input and produce any mixture as output.
Combination of modalities examples:
The AI model can take in a single or multiple modality inputs to generate single modality outputs.
For example:
Input: Text + Image, Output: Video + Sound
Input: Audio + Image, Output: Text + Image
Input: Text, Output: Video + Audio
Input: Video + Audio, Output: Text
Single to Single Generations examples:
The model should also be able to handle inputs in a single modality to single modality output, such as:
Text -> Image
Audio -> Text
Image -> Video
Image -> Audio
Audio -> Text
Image -> text
Criteria for Market Close
OpenAI must officially announce the model's capabilities to meet these criteria.
A staggered or slow release of the model is acceptable (by means of the API or the UI interface).
OpenAI allows at least some portion of the general public access to the model.
Market inspiration comes from rumors about "Arrakis" and academic work on Composable Diffusion (https://arxiv.org/pdf/2305.11846.pdf).
Video = sequence of image + sound?
i.e. is GPT-4 considered to consume video because it can consume multiple images as input?
Or are you thinking more like "video-specialized autoencoder that maps a time-indexed (image, sound) to the embedding space as a single object"?
@Mira No, GPT-4 isn't considered (under this definition) to consume video because it can handle multiple images as inputs. (Can it, though? I haven't checked.)
Of course it should have temporal qualities of a video, but i won't comment on the way it might be trained to ensure that.
@firstuserhere It's hard to tell since it's not public. Bing currently only keeps a single image in context, but I believe the reason is for cost not capability. AFAIK, images get mapped to the embedding space just like a token, so I would think it does.
I would expect any official video support to use a better representation: They don't have enough GPUs to do it like that, and it would be hard to train. But you don't want someone arguing on a technicality that it supports video because it supports images + sound...
@Mira Yep, not going to count that or other similar technicalities.
As for the training, from the CoDI paper:
Specifically, we start by independently training image, video, audio, and text LDMs.
These diffusion models then efficiently learn to attend across modalities for joint multimodal generation (Section 3.4) by a novel mechanism named “latent alignment”.
and finally:
The final step is to enable cross-attention between diffusion flows in joint generation, i.e., generating two or more modalities simultaneously. This is achieved by adding cross-modal attention sublayers to the UNet.