Will gpt4 do video summarization?

Get Ṁ200 play money
Sort by:
bought Ṁ150 of NO

GPT-4 takes images and text as input, not video. This market should resolve NO, though you can wait for the API to be released to confirm that only images will be accepted as input.

bought Ṁ26 of YES

@Mira How would it even have a score on TVQA if it couldn’t take video as input? Or LSMDC? This statement doesn’t really make sense in light of the released info.

The difference is basically semantic anyway, videos are just series of frames, which are images. The paper directly demonstrates it can take multiple images at once as input (see the InstructGPT example).

This is true whether or not the first version of the public facing API includes video (or multi image) input.

bought Ṁ200 of NO

@DanStoyell The technical report states: "GPT-4 accepts prompts consisting of both images and text, which—parallel to the text-only setting—lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images."

They probably tested it on stills from clips with captions. The technical paper doesn't even mention the word "video" anywhere in there.

The difference between multi-image and video is whether the model has something like a positional embedding, but for time. Maybe if you send a bunch of clips it'll have an index, but it won't have a time, and it'll probably cap out at some small number of clips so you can't do anything more than a couple stills.

predicts YES

@Mira I'm not sure I understand where our object-level disagreement is. We agree that GPT-4 can take multiple images as input, and that videos are composed of multiple images, and that it is SOTA on a video QA dataset and near-SOTA on a clip description dataset. If a model can ingest a series of images that when played at the right rate form a video, and can then accurately answer questions and describe the video, that is what "video summarization" is, no?

The TVQA challenge offers a timestamp-annotated version and one without timestamps, and historically different models have used different approaches to encode time for video. It also has 460 hours of video, so achieving a high score on this challenge would require not capping out at a "couple stills".

I am not betting above 50% partially because I am happy to keep buying as you dip this market <20%, but also because I don't know whether @AlexKChen will require audio, which would be a reasonable (if non-standard) interpretation of video.

predicts YES

@Mira One way to resolve this to both our satisfaction would probably just be to wait for a bit. As the paper mentions:

However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle. We plan to release further analyses and evaluation numbers as well as thorough investigation of the effect of test-time techniques soon.

If these further analyses do not allow video summarization after a year or so (or market close although that would be a long time), then resolving NO seems reasonable to me.

bought Ṁ50 of YES

GPT4 already beats SOTA on TVQA, a "large-scale video QA dataset based on 6 popular TV shows". It is competitive on LSMDC, which is "large scale movie description" on clips.

The release post also says "However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle."

Why is this market so low? Seems extremely likely that GPT4 will be able to do video summarization, even if not perfectly (which this market doesn't require) if it is already SOTA at video QA and clip summarization.

Arguably this market should resolve to YES based on LSMDC alone, since clips are technically videos?