Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)

1.8kṀ4143

Dec 31

12%

chance

ALL

Definitions

Modalities: This market considers four key modalities: Image, Audio, Video, Text.
Any Input-Output Combination: The AI should be versatile enough to accept any mixture of these modalities as input and produce any mixture as output.

Combination of modalities examples:

The AI model can take in a single or multiple modality inputs to generate single modality outputs.

For example:

Input: Text + Image, Output: Video + Sound
Input: Audio + Image, Output: Text + Image
Input: Text, Output: Video + Audio
Input: Video + Audio, Output: Text

Single to Single Generations examples:

The model should also be able to handle inputs in a single modality to single modality output, such as:

Text -> Image
Audio -> Text
Image -> Video
Image -> Audio
Audio -> Text
Image -> text

Criteria for Market Close

OpenAI must officially announce the model's capabilities to meet these criteria.
A staggered or slow release of the model is acceptable (by means of the API or the UI interface).
OpenAI allows at least some portion of the general public access to the model.

Market inspiration comes from rumors about "Arrakis" and academic work on Composable Diffusion (https://arxiv.org/pdf/2305.11846.pdf).

Technical AI Timelines

OpenAI

AGI

Get

1,000

to start trading!

People are also trading

When will OpenAI Announce AI Robots?

Will Google Deepmind and OpenAI have a major collaborative initiative by the end of 2025? (1000 mana subsidy)

10% chance

Will OpenAI hint at [read description] or claim to have AGI by 2025 end?

2% chance

Will OpenAI hint at or claim to have AGI by 2025 end?

7% chance

Will OpenAI release true multimodal image generation for GPT-4.5 before 2026?

1% chance

Will OpenAI claim that it has achieved AGI in 2025?

3% chance

Will OpenAI consume more than $250M of Microsoft compute in 2025?

95% chance

Will the OpenAI Non-Profit become a major AI Safety research funder? (Announced by end of 2025)

8% chance

Will OpenAI be acquired by another company the end of 2025?

1% chance

Will OpenAI have a new name by the end of 2025?

Sort by:

What's surprised me is that everyone has bought YES! I've not seen a lot of one sided markets under AI category

If they announce the model but nobody can use it(or only Microsoft or big $$$ business partners can), does it count for YES? Or does it have to available to the public i.e. anyone in the US could sign up and test it?

@Mira edited the description to clarify this. Thanks!

Video = sequence of image + sound?

i.e. is GPT-4 considered to consume video because it can consume multiple images as input?

Or are you thinking more like "video-specialized autoencoder that maps a time-indexed (image, sound) to the embedding space as a single object"?

@Mira No, GPT-4 isn't considered (under this definition) to consume video because it can handle multiple images as inputs. (Can it, though? I haven't checked.)

Of course it should have temporal qualities of a video, but i won't comment on the way it might be trained to ensure that.

@firstuserhere It's hard to tell since it's not public. Bing currently only keeps a single image in context, but I believe the reason is for cost not capability. AFAIK, images get mapped to the embedding space just like a token, so I would think it does.

I would expect any official video support to use a better representation: They don't have enough GPUs to do it like that, and it would be hard to train. But you don't want someone arguing on a technicality that it supports video because it supports images + sound...

@Mira Yep, not going to count that or other similar technicalities.

As for the training, from the CoDI paper:

Specifically, we start by independently training image, video, audio, and text LDMs.
These diffusion models then efficiently learn to attend across modalities for joint multimodal generation (Section 3.4) by a novel mechanism named “latent alignment”.

and finally:

The final step is to enable cross-attention between diffusion flows in joint generation, i.e., generating two or more modalities simultaneously. This is achieved by adding cross-modal attention sublayers to the UNet.