Was synthetic video data generated and used in training Sora?
Dec 15

Was synthetic video data generated for training Sora, eg large scale clips from Unreal Engine as suggested by this tweet. It does not count if there happens to be synthetic data in the training data. This will resolve yes if the consensus is large scale synthetic data generation was used for Sora.

If there is no clear consensus, I will resolve it based on my judgement using any material released by OpenAI.

Get Ṁ200 play money
Sort by:


No mention of internally generated synthetic data 🤔

bought Ṁ1,000 YES

Suppose GPT-4 labels youtube videos and Sora is trained on GPT-4 label -> video. Would this market resolve Yes?

The description specifically says:

Was synthetic data generated for training Sora, ie large scale clips from Unreal Engine as suggested by this tweet

We already knew that autocaptioning was used from the openai sora post before the market was created

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

I think it's obvious that the intent of this market didn't include auto-captioning, and was intended to capture if that tweet was accurate.

(if we want to go by literal english meaning, "i.e." means "is" or "specifically" so it specifies what the previous clause means, while if it said "e.g." the video captions would potentially be one example of many)

It would not resolve yes. The market is specifically for a significant amount of synthetic video data generated for training Sora. Synthetic captions or other metadata doesn't count.

These are big limit orders - do you have inside info, @jacksonpolack ?

no! I just like large limit orders.

@jacksonpolack Can't tell if serious.

So what happens if no new information is available about the training data by the close time?

@Shump Good question! I extended the close time until the end of the year. I will resolve this question early if new information is available before then and resolve as N/A otherwise.

I think it should plausibly stay open for longer, openai isn't very open about how their models are trained

opened a Ṁ10,000 NO at 62% order

people have already taken 1k of this order!

limit order up for 10k no!

More related questions