This question resolves to YES if OpenAI's GPT-4 is trained on at least 2 distinct data modalities, such as images and text. Otherwise, it resolves to NO.
If GPT-4 is not released before 2024, this question will resolve to N/A.
Clarification: This question will resolve to YES if any single model consistently called "GPT-4" by the OpenAI staff is multi-modal. For example, if there are two single-modality models: one trained only on images, and one trained only on text, that will not count.
Clarification 2: This question will resolve on the basis of all of the models that are revealed to have GPT-4 in their name within 24 hours of the first official announcement from OpenAI of GPT-4.
«We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while worse than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks»
I think a slightly more interesting question is whether the first non-OpenAI users of GPT-4 will be able to produce text from image or audio, or be able to produce image or audio from text. We often don't know many training details, and the modalities of training data could be ambiguous or not intuitive, such as if it's trained with a lot of transcribed audio data but no waveforms are encoded in the GPT-4 model itself.
I wouldn't rephrase this market or make a new one, but I just want to flag this.
https://twitter.com/_SilkeHahn/status/1634230731265196039
Update on the Heise story from the journalist responsible.
She states on Twitter that Microsoft Germany's Chief Technologist (not CTO) contacted her with a request to correct his name in the article, but no other changes. She interprets this as essentially confirmation that her statement in the article title claiming GPT-4 is multimodal is correct.
https://github.com/microsoft/visual-chatgpt
Not quite sure how this would resolve. Bought YES to get towards 50%…
@MaxPayne This is not GPT-4, and this is not even a single model. That's multiple models connected together via LangChain.
My main guess about what's going on:
The CTO mentioned that GPT-4 would be announced next week and then separately talked about all the AI services that will be/are available.
He was just emphasizing how AI APIs will allow for increasingly multimodal stuff. EG Whisper enables multimodality because it does speech-to-text well. He miight have claimed that text to video models are in the works or are about to be announced, as Sam Altman has previously mentioned, but also he might have just been talking about how video can be transcribed with Whisper. The mentioned embeddings are maybe just the latest Ada embeddings.
The focus on multimodality at the same time as the comment about GPT-4 was at some point incorrectly interpreted by a journalist as a claim that GPT-4 would be a multimodal model.
@NoaNabeshima Small input as a german speaker (assuming you, or whoever reading is not one):
The original quote is:
"Wir werden nächste Woche GPT-4 vorstellen, da haben wir multimodale Modelle, die noch ganz andere Möglichkeiten bieten werden – zum Beispiel Videos"
Heise's translation:
"We will introduce GPT-4 next week, there we will have multimodal models that will offer completely different possibilities – for example videos"
That translation is very literal, but also very accurate. The german quote is equally imprecise.
My interpretation is this:
"We will introduce GPT-4 next week. (At some point) we will have miltimodel models that will be capable of handling videos. We may already have multimodal models (Remember, this is Microsoft saying this, not OpenAI, so could be refering to Kosmos-1), but future ones will be more capable."
Wish there was a recording of the stream referenced in the article to confirm this. But yeah, agree with the conclusion that video is unlikely. Torn on whether GPT-4 is multimodal or not.
@NoaNabeshima Except in the title, sorry. Maybe I should say that the CTO doesn't claim GPT-4 will be multimodal in a direct quote.
@MatthewBarnett can you clarify this? Seems to plausibly make tens of % of difference in expectation.
@JacyAnthis I will add to the description that this question will resolve to YES if any single model consistently called "GPT-4" by the OpenAI staff is multi-modal. If there are two single-modality models: one trained only on images, and one trained only on text, that will not count.
@MatthewBarnett Do they have to be released at the same time? For example, if this question was about ChatGPT, and they called visual-ChatGPT just ChatGPT, would this resolve negatively? I’m guessing so, or this question can never really resolve.
@BionicD0LPH1N I will resolve based on the first announcement of GPT-4, and on the basis all the models that are announced or revealed that have GPT-4 in their name within 24 hours of that first official announcement. I will update the description accordingly.