Definitions
Modalities: This market considers four key modalities: Image, Audio, Video, Text.
Any Input-Output Combination: The AI should be versatile enough to accept any mixture of these modalities as input and produce any mixture as output.
Combination of modalities examples:
The AI model can take in a single or multiple modality inputs to generate single modality outputs.
For example:
Input: Text + Image, Output: Video + Sound
Input: Audio + Image, Output: Text + Image
Input: Text, Output: Video + Audio
Input: Video + Audio, Output: Text
Single to Single Generations examples:
The model should also be able to handle inputs in a single modality to single modality output, such as:
Text -> Image
Audio -> Text
Image -> Video
Image -> Audio
Audio -> Text
Image -> text
Criteria for Market Close
OpenAI must officially announce the model's capabilities to meet these criteria.
A staggered or slow release of the model is acceptable (by means of the API or the UI interface).
OpenAI allows at least some portion of the general public access to the model.
Market inspiration comes from rumors about "Arrakis" and academic work on Composable Diffusion (https://arxiv.org/pdf/2305.11846.pdf).