Will OpenAI announce that they will stop using freely crawled data for training large language models, by April 9 2024?
33
411
790
resolved Apr 11
Resolved
NO

Kyunghyun Cho predicts: "OpenAI will announce within 6 months that they are from there on exclusively using sourced (and often paid for data) and stop using freely crawled data for their large language models."

https://twitter.com/kchonyc/status/1711182191751729238

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ367
2Ṁ308
3Ṁ67
4Ṁ60
5Ṁ39
Sort by:

Resolution please @Rodeo

bought Ṁ50,000 NO

resolves no (in 3 hours)

What exactly separates "freely crawled" data from "sourced" data?

predicted YES

I think my market on any to any modality is somewhat related to this if this happens:

predicted YES

Adding a small $100 Ṁ subsidy to begin with

predicted YES

Similar:

sold Ṁ10 of NO

@jacksonpolack I'd make the NO bet too, as a first reaction but now I'm thinking about the possible scenarios that would lead to such an announcement, and I'm not so sure.

bought Ṁ100 of YES

@firstuserhere Off the top of my head:

OpenAI will announce within 6 months that they are from there on exclusively using sourced (and often paid for data) and stop using freely crawled data for their large language models.

  1. A shift away from Large Language Models to other modalities, where there is a lot of scope for human-generated data.

  2. Lots of collection done prior to the announcement

    1. This statement doesn't say they will delete or stop using the data they’ve already freely crawled up to that point.

  3. Push for legislation backed by openAI

    1. If the statement is taken at face value, it uses the phrase "exclusively using sourced" which could mean that they might engage in "partnerships" from there on out

    2. OpenAI creates a spinoff company called OpenDataCollectionAgency or something and then strike an exclusive partnership with them while keeping hands clean.

  4. Sourced data

    1. Stop crawling websites and get partnerships to get data streams of the best sources of human generated data, and do the rest of language data in-house?

  5. Data transforms

    1. creating new datasets

    2. Focusing on transfer learning a lot more

    3. opencrowdsourcingAI for data gathering

    4. The definition of a "large language model" changes enough to let this be true

    5. An online learning system, or continuous learning system... etc etc

    6. etc

(just random thoughts, could be totally off).

bought Ṁ83 of NO
  1. This statement doesn't say they will delete or stop using the data they’ve already freely crawled up to that point.

I (strongly) interpret the market to say they will stop using the data they've crawled until this point for future models, not that they'll stop crawling. Data you freely crawled in 2019 is still "freely crawled data", and "stopping using it" means you're not using it. Not using future crawled data is very low cost for them. Hopefully the creator can clarify that.

Other modalities: ehh, I think medium to long-length text is the best source overall.

OpenAI creates a spinoff company called OpenDataCollectionAgency or something and then strike an exclusive partnership with them while keeping hands clean.

If they do that and NotOpenAI crawls new data without making agreements with the sites it's crawling and sells it to openai I'd hope the market would still resolve no, that's just silly otherwise.

Stop crawling websites and get partnerships to get data streams of the best sources of human generated data, and do the rest of language data in-house?

They 100% could do that, though. Just pay all the big social media sites, pay book publishers, pay journals, etc.

And future court decisions or laws are also an issue.

Another important thing is there are a lot of public domain texts that OpenAI uses. Probably they "freely crawled those". And they're not going to stop using them. I think the market we're interested in excludes those from consideration, but the current one doesn't obviously do so.

predicted YES

@jacksonpolack

I (strongly) interpret the market to say they will stop using the data they've crawled until this point for future models, not that they'll stop crawling.

I hope so too! My bet here so far is just throwing mana away on a what if haha.

"stopping using it" means you're not using it.

A GPT-5 like model which has memorized a lot of the data being used to re-create something -> is that the same as using the original data?

If they do that and NotOpenAI crawls new data without making agreements with the sites it's crawling and sells it to openai I'd hope the market would still resolve no, that's just silly otherwise.

Agreed, which is why I'm making the point in the first place.

Another important thing is there are a lot of public domain texts that OpenAI uses. Probably they "freely crawled those". And they're not going to stop using them.

There are so many parts of the statement that might happen while others dont. Market is unclear. Request to @Rodeo to clarify what is required for a resolution. If openAI claims to not collect free data in the future , is that gonna count ? do they have to say they will stop using past collected data?