Will gpt4 do video summarization?

Get Ṁ500 play money

Related questions

(M25000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
Mira avatarMira
45% chance
Will GPT-3.5 solve any freshly-generated Sudoku puzzle? (2023)
Mira avatarMira
33% chance
(M1000 subsidy) Will GPT-4 solve any freshly-generated Sudoku puzzle? (2023)
Mira avatarMira
71% chance
Will GPT-4 learn to not say that the truck driver driving down a one-way street was walking?
ZviMowshowitz avatarZvi Mowshowitz
60% chance
Can Anyone Make ChatGPT 4 Solve this Middle School Math Problem?
Can any model of Chat-GPT solve Wordle?
8 avatarTrong
53% chance
Will Google's Gemini beat GPT4 in terms of capabilities on release?
brubsby avatarbrubsby
75% chance
How will Mira’s main GPT-4 Sudoku market resolve?
Will Gemini be widely considered better than GPT-4?
YoavTzfati avatarYoav Tzfati
65% chance
Will GPT-4 have over 1 trillion parameters?
EA42 avatarEmbedded Agent
94% chance
Will GPT-4 be trained (roughly) compute-optimally using the best-known scaling laws at the time?
BionicD0LPH1N avatarBionic
37% chance
Will GPT-5 be capable of recursive self-improvement?
NathanHelmBurger avatarNathan
26% chance
Is GPT-4 best? (May-Oct)
Gigacasting avatarGigacasting
85% chance
Will GPT-4 visual model (as released by OpenAI) show ability to tell if an object is inside or outside another object?
firstuserhere avatarfirstuserhere
60% chance
Who will find the first prompt enabling GPT-4 to solve one freshly-generated Sudoku puzzle? (multibinary, 2023)
Will GPT-4 have 500b+ parameters?
Will GPT-4 still be unaligned? (Gary Marcus GPT-4 prediction #6)
IsaacKing avatarIsaac
78% chance
Will there be a million dollar mistake publicly blamed on ChatGPT before 2024?
ValentinGolev avatarValentin Golev
34% chance
Will GPT-4 still not be safe to use for downstream programs? (Gary Marcus GPT-4 prediction #4)
IsaacKing avatarIsaac
80% chance
Will an unrestricted GPT-5 be able to make a bioengineered virus?
KabirKumar58cb avatarKabir Kumar
42% chance
Sort by:
Mira avatar
Mirabought Ṁ150 of NO

GPT-4 takes images and text as input, not video. This market should resolve NO, though you can wait for the API to be released to confirm that only images will be accepted as input.

4 replies
DanMan314 avatar
Danbought Ṁ26 of YES

@Mira How would it even have a score on TVQA if it couldn’t take video as input? Or LSMDC? This statement doesn’t really make sense in light of the released info.

The difference is basically semantic anyway, videos are just series of frames, which are images. The paper directly demonstrates it can take multiple images at once as input (see the InstructGPT example).

This is true whether or not the first version of the public facing API includes video (or multi image) input.

Mira avatar
Mirabought Ṁ200 of NO

@DanStoyell The technical report states: "GPT-4 accepts prompts consisting of both images and text, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images."

They probably tested it on stills from clips with captions. The technical paper doesn't even mention the word "video" anywhere in there.

The difference between multi-image and video is whether the model has something like a positional embedding, but for time. Maybe if you send a bunch of clips it'll have an index, but it won't have a time, and it'll probably cap out at some small number of clips so you can't do anything more than a couple stills.

DanMan314 avatar
Danpredicts YES

@Mira I'm not sure I understand where our object-level disagreement is. We agree that GPT-4 can take multiple images as input, and that videos are composed of multiple images, and that it is SOTA on a video QA dataset and near-SOTA on a clip description dataset. If a model can ingest a series of images that when played at the right rate form a video, and can then accurately answer questions and describe the video, that is what "video summarization" is, no?

The TVQA challenge offers a timestamp-annotated version and one without timestamps, and historically different models have used different approaches to encode time for video. It also has 460 hours of video, so achieving a high score on this challenge would require not capping out at a "couple stills".

I am not betting above 50% partially because I am happy to keep buying as you dip this market <20%, but also because I don't know whether @AlexKChen will require audio, which would be a reasonable (if non-standard) interpretation of video.

DanMan314 avatar
Danpredicts YES

@Mira One way to resolve this to both our satisfaction would probably just be to wait for a bit. As the paper mentions:

However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle. We plan to release further analyses and evaluation numbers as well as thorough investigation of the effect of test-time techniques soon.

If these further analyses do not allow video summarization after a year or so (or market close although that would be a long time), then resolving NO seems reasonable to me.

DanMan314 avatar
Danbought Ṁ50 of YES

GPT4 already beats SOTA on TVQA, a "large-scale video QA dataset based on 6 popular TV shows". It is competitive on LSMDC, which is "large scale movie description" on clips.

The release post also says "However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle."

Why is this market so low? Seems extremely likely that GPT4 will be able to do video summarization, even if not perfectly (which this market doesn't require) if it is already SOTA at video QA and clip summarization.

Arguably this market should resolve to YES based on LSMDC alone, since clips are technically videos?