Was synthetic video data generated and used in training Sora?

151

1.4kṀ73k

Dec 15

36%

chance

ALL

Was synthetic video data generated for training Sora, eg large scale clips from Unreal Engine as suggested by this tweet. It does not count if there happens to be synthetic data in the training data. This will resolve yes if the consensus is large scale synthetic data generation was used for Sora.

If there is no clear consensus, I will resolve it based on my judgement using any material released by OpenAI.

Get

1,000

to start trading!

People are also trading

When will an open-source AI video model comparable to Sora from OpenAI be released?

How many seconds will Sora take to generate 10 seconds of video?

84% chance

When will OpenAI release an AI Video model (sora) with audio/sound?

How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version)

What will be true of OpenAI's Sora* model, at the end of 2025? [*see description]

How many parameters does the OpenAI Sora model unveiled in February 2024 have?

Will someone make a video generator < 3B parameters as good as Sora by EOY 2025?

35% chance

Will OpenAI create a game-playing AI that uses Sora?

25% chance

Disney-OpenAI deal goes through so that Sora 2 is used by Disney (in 2025)?

Sort by:

@traders extending close time until a clearwr consensus is available.

i’d recommend placing a date after which it resolves N/A or 50% or something (unless there s a clearer consensus by then) to not have it stay open forever

(alternatively, it could resolve NO unless indication otherwise by some date)

The Sora safety card lists data sources and seems to leave out anything which would cause a Yes resolution:
>>> Sora was trained on diverse datasets, including a mix of publicly available data, proprietary data accessed through partnerships, and custom datasets developed in-house. These consist of:

Select publicly available data, mostly collected from industry-standard machine learning datasets and web crawls.
Proprietary data from data partnerships. We form partnerships to access non-publicly available data. For example, we partnered with Shutterstock⁠ Pond5 on building and delivering AI-generated images. We also partner to commission and create datasets fit for our needs.
Human data: Feedback from AI trainers, red teamers, and employees.

reposted

I think until this is in the hands of the general public (or close to it) it will be difficult to resolve. It would need targeted testing, eg seeding with specific Unreal frames or footage, specifically generating scenes Unreal can be detected with tell-tale signs, etc and isn’t likely to definitively show in the hand-picked videos OpenAI releases.

I would be long yes on this from what I’ve seen and read but since there is no long and the market resolves soon, I’m staying out.

is this ever resolving lol

bought Ṁ50 NO

my priors would be 90% yes but i spoke to a knowledgeable person that suggested no, so i put it at 30% yes

https://twitter.com/Carnage4Life/status/1768267008192164313

No mention of internally generated synthetic data 🤔

bought Ṁ1,000 YES

Suppose GPT-4 labels youtube videos and Sora is trained on GPT-4 label -> video. Would this market resolve Yes?

The description specifically says:

Was synthetic data generated for training Sora, ie large scale clips from Unreal Engine as suggested by this tweet

We already knew that autocaptioning was used from the openai sora post before the market was created

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

I think it's obvious that the intent of this market didn't include auto-captioning, and was intended to capture if that tweet was accurate.

(if we want to go by literal english meaning, "i.e." means "is" or "specifically" so it specifies what the previous clause means, while if it said "e.g." the video captions would potentially be one example of many)

It would not resolve yes. The market is specifically for a significant amount of synthetic video data generated for training Sora. Synthetic captions or other metadata doesn't count.

These are big limit orders - do you have inside info, @jacksonpolack ?

no! I just like large limit orders.

@jacksonpolack Can't tell if serious.

@EliezerYudkowsky it was serious! I was just guessing and like large limit orders. Seemed like the reasons to believe YES weren't too strong

So what happens if no new information is available about the training data by the close time?

@Shump Good question! I extended the close time until the end of the year. I will resolve this question early if new information is available before then and resolve as N/A otherwise.