Will OpenAI announce that they will stop using freely crawled data for training large language models, by April 9 2024?

Ṁ790Ṁ67k

resolved Apr 11

Resolved

ALL

Kyunghyun Cho predicts: "OpenAI will announce within 6 months that they are from there on exclusively using sourced (and often paid for data) and stop using freely crawled data for their large language models."

https://twitter.com/kchonyc/status/1711182191751729238

Market context

OpenAI

AI Regulation

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ367
2		Ṁ308
3		Ṁ67
4		Ṁ60
5		Ṁ39

People are also trading

Will OpenAI publicly release a GPT-5.6 model by August 31, 2026?

95% chance

Will OpenAI release another open source LLM before end of 2026?

70% chance

Will OpenAI, Anthropic or Google cut API access to their text models, in 2026?

7% chance

OpenAI's next "GPT-OSS" release date

Will OpenAI publicly release a GPT-5.6 model by September 30, 2026?

97% chance

Will OpenAI delay OR withdraw a model release due to demands by the US US government by EOY 2026?

97% chance

Will OpenAI disappear before 2034?

29% chance

Will a US entity force OpenAI to delete any mainline GPT models by the end of 2026?

13% chance

Will OpenAI become nothing by 2030?

Sort by:

Resolution please @Rodeo

bought Ṁ50,000 NO

resolves no (in 3 hours)

What exactly separates "freely crawled" data from "sourced" data?

predictedNO

https://www.cnn.com/2023/12/27/tech/new-york-times-sues-openai-microsoft/index.html

predictedYES

I think my market on any to any modality is somewhat related to this if this happens:

predictedYES

Adding a small $100 Ṁ subsidy to begin with

predictedYES

Similar:

@jacksonpolack I'd make the NO bet too, as a first reaction but now I'm thinking about the possible scenarios that would lead to such an announcement, and I'm not so sure.

@firstuserhere Off the top of my head:

OpenAI will announce within 6 months that they are from there on exclusively using sourced (and often paid for data) and stop using freely crawled data for their large language models.

A shift away from Large Language Models to other modalities, where there is a lot of scope for human-generated data.
Lots of collection done prior to the announcement
1. This statement doesn't say they will delete or stop using the data they’ve already freely crawled up to that point.
Push for legislation backed by openAI
1. If the statement is taken at face value, it uses the phrase "exclusively using sourced" which could mean that they might engage in "partnerships" from there on out
2. OpenAI creates a spinoff company called OpenDataCollectionAgency or something and then strike an exclusive partnership with them while keeping hands clean.
Sourced data
1. Stop crawling websites and get partnerships to get data streams of the best sources of human generated data, and do the rest of language data in-house?
Data transforms
1. creating new datasets
2. Focusing on transfer learning a lot more
3. opencrowdsourcingAI for data gathering
4. The definition of a "large language model" changes enough to let this be true
5. An online learning system, or continuous learning system... etc etc
6. etc

(just random thoughts, could be totally off).

This statement doesn't say they will delete or stop using the data they’ve already freely crawled up to that point.

I (strongly) interpret the market to say they will stop using the data they've crawled until this point for future models, not that they'll stop crawling. Data you freely crawled in 2019 is still "freely crawled data", and "stopping using it" means you're not using it. Not using future crawled data is very low cost for them. Hopefully the creator can clarify that.

Other modalities: ehh, I think medium to long-length text is the best source overall.

OpenAI creates a spinoff company called OpenDataCollectionAgency or something and then strike an exclusive partnership with them while keeping hands clean.

If they do that and NotOpenAI crawls new data without making agreements with the sites it's crawling and sells it to openai I'd hope the market would still resolve no, that's just silly otherwise.

Stop crawling websites and get partnerships to get data streams of the best sources of human generated data, and do the rest of language data in-house?

They 100% could do that, though. Just pay all the big social media sites, pay book publishers, pay journals, etc.

And future court decisions or laws are also an issue.

Another important thing is there are a lot of public domain texts that OpenAI uses. Probably they "freely crawled those". And they're not going to stop using them. I think the market we're interested in excludes those from consideration, but the current one doesn't obviously do so.

predictedYES

@jacksonpolack

I (strongly) interpret the market to say they will stop using the data they've crawled until this point for future models, not that they'll stop crawling.

I hope so too! My bet here so far is just throwing mana away on a what if haha.

"stopping using it" means you're not using it.

A GPT-5 like model which has memorized a lot of the data being used to re-create something -> is that the same as using the original data?

If they do that and NotOpenAI crawls new data without making agreements with the sites it's crawling and sells it to openai I'd hope the market would still resolve no, that's just silly otherwise.

Agreed, which is why I'm making the point in the first place.

Another important thing is there are a lot of public domain texts that OpenAI uses. Probably they "freely crawled those". And they're not going to stop using them.

There are so many parts of the statement that might happen while others dont. Market is unclear. Request to @Rodeo to clarify what is required for a resolution. If openAI claims to not collect free data in the future , is that gonna count ? do they have to say they will stop using past collected data?