Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)


  • Modalities: This market considers four key modalities: Image, Audio, Video, Text.

  • Any Input-Output Combination: The AI should be versatile enough to accept any mixture of these modalities as input and produce any mixture as output.

Combination of modalities examples:

The AI model can take in a single or multiple modality inputs to generate single modality outputs.

For example:

  • Input: Text + Image, Output: Video + Sound

  • Input: Audio + Image, Output: Text + Image

  • Input: Text, Output: Video + Audio

  • Input: Video + Audio, Output: Text

Single to Single Generations examples:

The model should also be able to handle inputs in a single modality to single modality output, such as:

  • Text -> Image

  • Audio -> Text

  • Image -> Video

  • Image -> Audio

  • Audio -> Text

  • Image -> text

Criteria for Market Close

  • OpenAI must officially announce the model's capabilities to meet these criteria.

  • A staggered or slow release of the model is acceptable (by means of the API or the UI interface).

  • OpenAI allows at least some portion of the general public access to the model.

Market inspiration comes from rumors about "Arrakis" and academic work on Composable Diffusion (https://arxiv.org/pdf/2305.11846.pdf).

Get Ṁ600 play money
Sort by:

What's surprised me is that everyone has bought YES! I've not seen a lot of one sided markets under AI category

If they announce the model but nobody can use it(or only Microsoft or big $$$ business partners can), does it count for YES? Or does it have to available to the public i.e. anyone in the US could sign up and test it?

@Mira edited the description to clarify this. Thanks!

Video = sequence of image + sound?

i.e. is GPT-4 considered to consume video because it can consume multiple images as input?

Or are you thinking more like "video-specialized autoencoder that maps a time-indexed (image, sound) to the embedding space as a single object"?

@Mira No, GPT-4 isn't considered (under this definition) to consume video because it can handle multiple images as inputs. (Can it, though? I haven't checked.)

Of course it should have temporal qualities of a video, but i won't comment on the way it might be trained to ensure that.

@firstuserhere It's hard to tell since it's not public. Bing currently only keeps a single image in context, but I believe the reason is for cost not capability. AFAIK, images get mapped to the embedding space just like a token, so I would think it does.

I would expect any official video support to use a better representation: They don't have enough GPUs to do it like that, and it would be hard to train. But you don't want someone arguing on a technicality that it supports video because it supports images + sound...

@Mira Yep, not going to count that or other similar technicalities.

As for the training, from the CoDI paper:

Specifically, we start by independently training image, video, audio, and text LDMs.

These diffusion models then efficiently learn to attend across modalities for joint multimodal generation (Section 3.4) by a novel mechanism named “latent alignment”.

and finally:

The final step is to enable cross-attention between diffusion flows in joint generation, i.e., generating two or more modalities simultaneously. This is achieved by adding cross-modal attention sublayers to the UNet.

More related questions