2) We are going to start running out of data to train large language models.
50
312
930
resolved Dec 12
Resolved
YES

It has become a cliché to say that data is the new oil. This analogy is fitting in one underappreciated way: both resources are finite and at risk of being exhausted. The area of AI for which this concern is most pressing is language models.

As we discussed in the previous section, research efforts like DeepMind’s Chinchilla work have highlighted that the most effective way to build more powerful large language models (LLMs) is not to make them larger but to train them on more data.

But how much more language data is there in the world? (More specifically, how much more language data is there that meets an acceptable quality threshold? Much of the text data on the internet is not useful to train an LLM on.)

This is a challenging question to answer with precision, but according to one research group, the world’s total stock of high-quality text data is between 4.6 trillion and 17.2 trillion tokens. This includes all the world’s books, all scientific papers, all news articles, all of Wikipedia, all publicly available code, and much of the rest of the internet, filtered for quality (e.g., webpages, blogs, social media). Another recent estimate puts the total figure at 3.2 trillion tokens.

DeepMind’s Chinchilla model was trained on 1.4 trillion tokens.

In other words, we may be well within one order of magnitude of exhausting the world’s entire supply of useful language training data. This could prove a meaningful impediment to continued progress in language AI. Privately, many leading AI researchers and entrepreneurs are worried about this.

Expect to see plenty of focus and activity in this area next year as LLM researchers seek to address the looming data shortage. One possible solution is synthetic data, though the details about how to operationalize this are far from clear. Another idea: systematically transcribing the spoken content of the world’s meetings (after all, spoken discussion represents vast troves of text data that today go uncaptured).

As the world’s leading LLM research organization, how OpenAI deals with this challenge in its soon-to-be-announced GPT-4 research will be fascinating and illuminating to see.


If you enjoyed this market, please check out the other 9! https://manifold.markets/group/forbes-2023-ai-predictions

This market is from Rob Toews' annual AI predictions at Forbes magazine. This market will resolve based on Rob's own self-assessed score for these predictions when he publishes his retrospective on them at the end of the year.

Since Rob resolved and graded his 2022 predictions before the end of 2022, I am setting the close date ahead of the end of the year, to (try to) avoid a situation where he posts the resolutions before the market closes. In the event that his resolution post falls in 2024, my apologies in advance. If he hasn't posted resolutions at all by February 1, 2024, I will do my best to resolve them personally, and set N/A for any questions that I can't determine with outside source data.

-----

Edit 2023-07-05: Last year Rob used "Right-ish" to grade some of his predictions. In cases of a similar "Right-ish" (or "Wrong-ish") answer this year, I will resolve to 75% PROB or 25% PROB, respectively. This will apply for similar language too ("mostly right", "partial credit", "in the right direction"). If he says something like "hard to say" or "some right, some wrong", or anything else that feels like a cop-out or 50% answer, I will just call that N/A.

Thanks to Henri Thunberg from this comment in requesting clarification!

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ148
2Ṁ71
3Ṁ55
4Ṁ54
5Ṁ44
Sort by:
predicted NO

@eccentricity - I 💯 agree that the claim isn't true. But I set up the market to resolve based on Rob's answer/BS, and not on reality. 😆 🤷

Wow, I can't believe they resolved this prediction to YES. Nonsensical

@firstuserhere They gave reasons about how their prediction did not come true, but why it will come true eventually and still judged it correct!

Prediction 2: We are going to start running out of data to train large language models.

Outcome: Correct

In last year’s predictions, we noted that there is a finite amount of text data in the world available to train language models (taking into account all the world’s books, news articles, research papers, code, Wikipedia articles, websites, and so on)—and predicted that, as LLM training efforts rapidly scale, we would soon begin to exhaust this finite resource.

A year later, we are much closer to running out of the world’s supply of text training data. As of the end of 2022, the largest known LLM training corpus was the 1.4 trillion token dataset that DeepMind used to train its Chinchilla model. LLM builders have blown past that mark in 2023. Meta’s popular Llama 2 model (released in July) was trained on 2 trillion tokens. Alphabet’s PaLM 2 model (released in May) was trained on 3.6 trillion tokens. As mentioned above, OpenAI’s GPT-4 is rumored to have been trained on 13 trillion tokens.

In October, AI startup Together released an LLM dataset named RedPajama-2 that contains a stunning 30 trillion tokens—by far the largest such dataset yet created.

It is impracticable to determine exactly how many total usable tokens of text data exist in the world. Some previous estimates had actually pegged the number below 30 trillion. While the new Together dataset suggests that those estimates were too low, it is clear that we are fast approaching the limits of available text training data.

Those at the cutting edge of LLM research are well aware of this problem and are actively working to address it.

In a newly announced initiative called Data Partnerships, OpenAI has solicited partnerships with organizations around the world in order to gain access to new sources of training data.

In September, Meta announced a new model named Nougat that uses advanced OCR to turn the contents of old scientific books and journals into a more LLM-friendly data format. As AI researcher Jim Fan put it: “This is the way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready.”

These are clever initiatives to expand the pool of available text training data and stave off the looming data shortage. But they will only postpone, not resolve, the core dilemma created by insatiably data-hungry models.

predicted NO

@firstuserhere I guess they are backdating the "start" of the "running out" to sometime between 2023-01 and today. Before then, we hadn't "started" running out yet. 🙄

What might be the reasons for a slow down in AI progress?

Not going to bet on Rob Toews will resolve his own predictions, but I am highly confident that this will be an impediment to further progress sometime within the next few years.

@DavidBolin Why are you confident of that?

@firstuserhere Just created a market, might be better to move discussion there for topic-specificity.

@DavidBolin Yep, Rob basically judged all his predictions as correct even when blatantly wrong

predicted NO

@firstuserhere Who could have forseen this outcome? 😂

bought Ṁ50 of NO

Ilya Sutskever says the data situation is still good (March 2023) https://youtu.be/Yf1o0TQzry8?t=657

More related questions