
It has become a cliché to say that data is the new oil. This analogy is fitting in one underappreciated way: both resources are finite and at risk of being exhausted. The area of AI for which this concern is most pressing is language models.
As we discussed in the previous section, research efforts like DeepMind’s Chinchilla work have highlighted that the most effective way to build more powerful large language models (LLMs) is not to make them larger but to train them on more data.
But how much more language data is there in the world? (More specifically, how much more language data is there that meets an acceptable quality threshold? Much of the text data on the internet is not useful to train an LLM on.)
This is a challenging question to answer with precision, but according to one research group, the world’s total stock of high-quality text data is between 4.6 trillion and 17.2 trillion tokens. This includes all the world’s books, all scientific papers, all news articles, all of Wikipedia, all publicly available code, and much of the rest of the internet, filtered for quality (e.g., webpages, blogs, social media). Another recent estimate puts the total figure at 3.2 trillion tokens.
DeepMind’s Chinchilla model was trained on 1.4 trillion tokens.
In other words, we may be well within one order of magnitude of exhausting the world’s entire supply of useful language training data. This could prove a meaningful impediment to continued progress in language AI. Privately, many leading AI researchers and entrepreneurs are worried about this.
Expect to see plenty of focus and activity in this area next year as LLM researchers seek to address the looming data shortage. One possible solution is synthetic data, though the details about how to operationalize this are far from clear. Another idea: systematically transcribing the spoken content of the world’s meetings (after all, spoken discussion represents vast troves of text data that today go uncaptured).
As the world’s leading LLM research organization, how OpenAI deals with this challenge in its soon-to-be-announced GPT-4 research will be fascinating and illuminating to see.
If you enjoyed this market, please check out the other 9! https://manifold.markets/group/forbes-2023-ai-predictions
This market is from Rob Toews' annual AI predictions at Forbes magazine. This market will resolve based on Rob's own self-assessed score for these predictions when he publishes his retrospective on them at the end of the year.
Since Rob resolved and graded his 2022 predictions before the end of 2022, I am setting the close date ahead of the end of the year, to (try to) avoid a situation where he posts the resolutions before the market closes. In the event that his resolution post falls in 2024, my apologies in advance. If he hasn't posted resolutions at all by February 1, 2024, I will do my best to resolve them personally, and set N/A for any questions that I can't determine with outside source data.
-----
Edit 2023-07-05: Last year Rob used "Right-ish" to grade some of his predictions. In cases of a similar "Right-ish" (or "Wrong-ish") answer this year, I will resolve to 75% PROB or 25% PROB, respectively. This will apply for similar language too ("mostly right", "partial credit", "in the right direction"). If he says something like "hard to say" or "some right, some wrong", or anything else that feels like a cop-out or 50% answer, I will just call that N/A.
Thanks to Henri Thunberg from this comment in requesting clarification!
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ148 | |
2 | Ṁ71 | |
3 | Ṁ55 | |
4 | Ṁ54 | |
5 | Ṁ44 |