Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?
13%
chance

There's a news report, paper, or blog post that I consider trustworthy reporting on a 1T parameter dense language model that was trained before January 1st, 2023.

Close date updated to 2022-12-31 11:59 pm Jun 16, 10:15pm: Since there's been some confusion in the comments, I want to clarify that "dense" here is in contrast to sparse models like mixture-of-experts. I've clarified more in a comment below, from which I've copied the relevant snippet:

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

EDIT: In the comments, Lauro Langosco di Langosco pointed out that it's possible we won't know as of the end of the year whether someone trained a model with 1T params in 2022 and just hadn't announced it yet. He also noted that the current phrasing of the question is very focused on trained vs. announced. Given that, my compromise here is to wait 3 months after the start of 2023 to resolve this rather than doing it right away. If noone announces something they trained with 1T parameters by then, I'll resolve "No".

EDIT 2: Lauro pushed back on three months as being too short, so I'll wait another year to resolve.

EDIT 3: After further deliberation, I've decided to stick with 3 months after 01/01/2023 for resolution. Another clarification I wanted to make is I won't definitely resolve positively unless the 1T model seems at least competitive with existing SoTA models. As mentioned in the comments, if someone trains a 1T parameter dense model as good as say davinci-002 or something, I'll have to decide what I'm going to do. I don't think that's likely enough to pre-plan for though.

Dec 12, 10:40am: Will someone train a 1T parameter dense language model model this year? → Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?

Sort by:
DanElton avatar
Dan Elton

Does this count? https://arxiv.org/abs/2110.03888

"We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days"

StephenMalina avatar
Stephen Malinais predicting NO at 13%

@DanElton I don't think so.
1) These seem at least partially like mixture-of-expert models:
> M6 is built with stacking transformer layers, which includes self attention and feed-forward neural nets (FFN). For the transformation from dense models to sparse expert models, we should only replace FFN layers with the Mixture-of-Expert (MoE) layers. MoE consists of multiple experts, which are usually FFNs distributed on different devices. A gating network decides the dispatching and combining behaviors of each token and thus tokens can be processed in diverse devices. Such mechanism is a combination of data parallelism and expert parallelism, and thus it is highly efficient though with large model capacity. For the training, to realize the learning of both understanding and generation, the model is trained with text denoising and language modeling on plain text data and with image-based text denoising and image captioning on multimodal data. The model is compatible with different types of downstream tasks and can process information of multiple modalities.

2) I specified in this comment that the model had to be competitive and this model is seemingly not at all.

NoaNabeshima avatar
Noa Nabeshimabought Ṁ200 of NO
NoaNabeshima avatar
Noa Nabeshimais predicting NO at 25%

@rocketsan I'm curious about your models-- why do you think someone will train a 1T parameter transformer?

vluzko avatar
Vincent Luczkowis predicting NO at 24%

My earlier joke aside, what is the cutoff for quality? Does it need to be near SOTA? If someone trains a 1T parameter model to, say, GPT-2 quality would that resolve YES?

StephenMalina avatar
Stephen Malinais predicting NO at 21%

@vluzko this isn't in the original description but the spirit of the question was a model that's at least trying to be somewhat better than existing ones.

StephenMalina avatar
Stephen Malinais predicting NO at 21%

@vluzko the biggest edge case is if someone trains something that's original or davinci-002 level quality with 1T dense parameters. I honestly don't know what I'd do then, kind of just hoping it doesn't happen. If it does, I'll solicit feedback from market participants.

NoaNabeshima avatar
Noa Nabeshimabought Ṁ300 of NO

Suppose that GPT-3 -> GPT-4 is a 2 OOM compute increase. If costs scale linearly, that would be $250~500M, which would be an ambitious but not implausible percentage of the OpenAI and DeepMind budget [DM annual budget ~1.7B, OA budget .2~.4B? less certain about OA budget than DM budget]. Probably budget isn't the right way of thinking about it, given that DM and OA have their own hardware, but it still gives a sense of the scale. That kind of spending doesn't look like it would quite reach 1T if the model is Chinchilla trained (eyeballing these graphs: https://arxiv.org/pdf/2203.15556.pdf). It looks like it might be another OOM to get to 1T, which would be out of OA's budget for GPT-4.

I'm not sure how to think about Google Brain, but I'm imagining that they have within a factor of 2 the budget of DeepMind.

L avatar
Lis predicting NO at 20%

@NoaNabeshima but GPT3 is already not dense, GPT4 probably also isn't.

L avatar
Lis predicting NO at 20%

@L (though to be clear that doesn't contradict your point, it makes your analysis unnecessarily weak evidence for your point.)

Lauro avatar
Lauro Langoscois predicting YES at 20%

@NoaNabeshima I don't think that's right: GPT3 was 175b params on 500b tokens. Scaling to 1T params and 20T tokens (chinchilla-optimal) means roughly 6x params and 40x compute, which gives 240x FLOP/s. That seems not crazy, especially given that compute costs probably scale sublinearly

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 20%

@L I thought GPT-3 would be considered dense for the purposes of this market

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

NoaNabeshima avatar
Noa Nabeshimabought Ṁ0 of NO

@Lauro Oh, nice! Why do you think compute costs scale sublinearly?

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 20%

How sublinearly are you imagining?

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 20%

@Lauro GPT-3 was trained on 300B tokens
https://arxiv.org/pdf/2005.14165.pdf

vluzko avatar
Vincent Luczkowis predicting NO at 20%

@L GPT3 isn't dense in a different sense - MoE vs banded attention.

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 18%

@vluzko I agree that it isn't dense in that sense, but my guess is that GPT-3 is dense in the spirit of the market. @StephenMalina ?

NoaNabeshima avatar
Noa Nabeshimasold Ṁ239 of NO

@Lauro Think my work might be wrong, but I think Metaculus community predicts ~3% chance of GPT-4 having that many FLOPs

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 18%

or more

StephenMalina avatar
Stephen Malinais predicting NO at 15%

@NoaNabeshima yes that's correct, GPT3 is dense by this definition!

LostFutures avatar
LostFutures

@L GPT3 isn't dense?

vluzko avatar
Vincent Luczkowis predicting NO at 18%

@LostFutures It is dense in the sense of this market - it isn't a mixture of experts model where only a fraction of the parameters are actually used in any forward/backward pass.

But it "isn't dense" - or rather it is sparse - in that some of the attention layers are locally banded.

Lauro avatar
Lauro Langoscois predicting YES at 20%

@NoaNabeshima oh good point! So if GPT3 is 300b tokens, that would make it 6x params and 66x tokens, which makes 400x FLOP/s.

Re sublinear: I expect compute costs to have gone down somewhat since GPT3 was trained, and I also expect standard economies of scale to help a little. I don't have a good sense for how much costs go down - I guess I'd expect compute costs to be something like 30%-80% of the GPT3 per FLOP cost? Don't take this estimate seriously though.

FWIW I think the metaculus market is miscalibrated. I'd put the probability at ~30% maybe

NoaNabeshima avatar
Noa Nabeshimais predicting NO at 18%

@Lauro hmmm, I don't have models of compute costs over time, so I'm just throwing thoughts around but do you think the chip shortage would affect things?
I'm surprised you think the metaculus market is miscalibrated in a particular direction. I do notice you've been profiting on Manifold. I'm curious about when you expect to beat Metaculus.

vluzko avatar
Vincent Luczkowis predicting NO at 16%

I can't resolve this myself by training a 1T parameter model really really slowly and then posting about it on my blog, right?

VictorLevoso avatar
Victor Levosois predicting YES at 16%

@vluzko You could train it on a single token maybe :b

You would have trouble fitting it in memory but there's ways to use a hard disk for that.

vluzko avatar
Vincent Luczkowis predicting NO at 14%

@VictorLevoso I know I can train a 1T parameter model at... checks disk rates maybe one pass per hour, the question is whether doing it would resolve this market.

StephenMalina avatar
Stephen Malinais predicting NO at 34%

@vluzko no 😂

Lauro avatar
Lauro Langoscois predicting YES at 20%

@StephenMalina hm... how good does a model have to be in order to qualify?

StephenMalina avatar
Stephen Malinais predicting NO at 21%
vluzko avatar
Vincent Luczkowbought Ṁ50 of NO

I'm surprised by how much weight people are still putting on YES. PaLM was 540B parameters, but that was with previous scaling laws. With current best estimates for scaling laws 1T parameters would be a massive jump in compute over the current largest models. We're only going to see this in 2022 if someone burns millions of dollars training a suboptimal model as a publicity stunt, discovers a breakthrough architecture that scales differently, or decides it's time to go all in on a particular model (and they started training a few months ago)

StephenMalina avatar
Stephen Malina

Really surprised this market hasn't moved at all as end of 2022 approaches. Maybe people are either not seeing it or they are holding out for GPT4?

Lauro avatar
Lauro Langoscois predicting YES at 10%

@StephenMalina presumably many models trained in 2022 will only be known to the public much later. Can take >1y from training to release / publication

StephenMalina avatar
Stephen Malinais predicting NO at 9%

@LauroLangoscodiLangosco I admittedly hadn't considered that. I was implicitly thinking of models we'd know about this year, but I can see how it's ambiguous. That said, do you really think that's why the price was high for so long? Given it eventually dropped, I tend to think it's just that people thought that one might get announced.

Lauro avatar
Lauro Langoscois predicting YES at 9%

@StephenMalina I imagine it may have been different groups of people interpreting the question different ways, or possibly just not thinking of publishing delays. I do think the probability is low even given that we wouldn't have heard of it if someone had trained a 1T model.

IMO the current description seems pretty unambiguous in that it asks for a model trained before Jan 1st 2023, not published. If you want to resolve differently you probably should update the description (though I'd encourage you to not do that, ideally).

StephenMalina avatar
Stephen Malinais predicting NO at 9%

@LauroLangoscodiLangosco yeah I think the more fair option is for me to update the description to say that I'll wait to resolve for 6 months after and if we haven't heard by then, I'll resolve to "No". Seem fair to you?

StephenMalina avatar
Stephen Malinais predicting NO at 9%

@StephenMalina actually 6 months is a long time, thinking 3 now.

Lauro avatar
Lauro Langoscois predicting YES at 9%

@StephenMalina Personally I don't mind much either way. IMO: if your goal is to resolve markets accurately as per the description, then I think you should wait for at least a year. If you're fine with your markets implicitly being about "what I mean by the description, not what it says", then it seems fine to resolve sooner. If you plan on doing so regularly, it's probably good to include a notice to that effect in your market descriptions.

(fyi regarding your note in the description: my name is Lauro and my pronouns are he/him 🙂 )

StephenMalina avatar
Stephen Malinais predicting NO at 21%

@Lauro oops, so sorry about that (fixed)!

(fyi regarding your note in the description: my name is Lauro and my pronouns are he/him 🙂 )

StephenMalina avatar
Stephen Malinais predicting NO at 21%

then I think you should wait for at least a year.

I want to be considered as explicit as possible, so I'll wait a year, although it pains me to have to track the market that long. Thank you for pushing back on my unprincipledness.

vluzko avatar
Vincent Luczkowis predicting NO at 21%

@StephenMalina I don't think you should wait a year, trained models are rarely if ever kept quiet and it seems kind of pointlessly annoying to drag this one out

StephenMalina avatar
Stephen Malinais predicting NO at 21%

@vluzko thanks, I wish there was a way to create a poll of people who have currently invested in this market on different resolution times...

StephenMalina avatar
Stephen Malinais predicting NO at 21%

@vluzko @Lauro thinking about it more, I will stick with 3 months. I have learned a lot from this and other comment threads on this market. Going forward, I will clarify that my markets should be assumed to resolve based on description + title not just title in the future. While I'd like to promise I'll come up with less ambiguous markets in the future and will try, I also think part of the value of Manifold is being able to bet on things that aren't perfectly specifiable so I want to be open that I will continue to create markets where there's some discretion involved. I'll try to be as explicit about this as possible when I do it though. Thank you both for your comments on this market, I really appreciate them.

StephenMalina avatar
Stephen Malina Given all the confusion in this market and the fact that I'm betting heavily in one direction, I'm a bit concerned that it's going to end up looking like I rugged everyone. As someone who really tries to avoid the appearance of cheating on Manifold, I don't want that, even if I do think I was fairly clear in the original question. Can people give thoughts on me resolving this to N/A and creating a new market that more clearly differentiates in the next two days? I'll make a decision then (Saturday afternoon).
LawrenceChan avatar
Lawrence Chanis predicting NO at 51% @StephenMalina I think the question w/clarification is fine as is.
StephenMalina avatar
Stephen Malina @LawrenceChan thanks, unless an overwhelming majority speaks up I'll keep it as is then.
Gigacasting avatar
Gigacasting Should be clarified to say “non-transformer” (No one even uses the term “Dense” anymore.)
LawrenceChan avatar
Lawrence Chanis predicting NO at 43% @Gigacasting No, GPT-3, Gopher, and PaLM are all "dense" transformer models, in that all parts of the model are engaged during a forward pass. In contrasts, I think a "sparse" model in this context refers to something like a Mixture-of-Experts or Switch Transformer model, where most of the parameters are inactive for given forward pass. @StepehnMalina please correct me if I'm misunderstanding our intent.
Gigacasting avatar
Gigacastingsold Ṁ47 of NOTerrible poll. Good luck.
agentydragon avatar
Rai @Gigacasting "dense transformer / sparse (MoE) transformer" is a term I as someone working with transformers am familiar with, though, admittedly, it could be confused with "fully connected feed-forward networks that are not transformers". the difference between dense/sparse transformer is whether the MLP part of the transformer is just a linear layer, or a MoE
Gigacasting avatar
Gigacasting If your “term of art” has existed for less than a couple months, isn’t on the front page of Google, and previously meant something else, you might be doing it wrong.
StephenMalina avatar
Stephen Malina @Gigacasting to be clear this is my fault not @Rai's. I apologize and clarify in the description. Just to be clear, it's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.
LawrenceChan avatar
Lawrence Chanbought Ṁ50 of NOReplying to Rai's comment, 1) Switch Transformers aren't dense models and 2) the result was from 2021: https://arxiv.org/abs/2101.03961
StephenMalina avatar
Stephen Malinabought Ṁ30 of NO@LawrenceChan thank you! I can tell I'm gonna get slammed when this resolves but I tried to make it clear that it was about dense models. Appreciate others helping out.
agentydragon avatar
Raiis predicting YES at 38% @StephenMalina yeah if you specifically want a dense model that's way less likely. Just because as far as I know dense models are becoming less popular. So it's a bit like a question like "will there be a car sold that can do 0-80 in 2 seconds, and is powered by diesel, not electricity"...
LawrenceChan avatar
Lawrence Chanis predicting NO at 43% I mean, PaLM did get released this year, and it has 540b parameters :)
StephenMalina avatar
Stephen Malina @LawrenceChan agree, and @rai I disagree that that's why it's way less likely. Dense models still seem quite popular amongst GB/OAI/DM given recent news. If anything the main downwards update was due to Chinchilla, which showed different scaling curves along which we are much further from a compute-optimal model having that many parameters.
agentydragon avatar
Raiis predicting YES at 43% https://twitter.com/LiamFedus/status/1536791574612303872?s=20&t=CC1hLbIOTqymt4Pm85TlkA "Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!"
LawrenceChan avatar
Lawrence Chansold Ṁ21 of NOGoogle trained a 540b model (PaLM): https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf
ampdot avatar
ampdotbought Ṁ20 of NOSmaller LLMs trained for longer outperform https://twitter.com/karpathy/status/1509227367302148098
StephenMalina avatar
Stephen Malinabought Ṁ1 of NO@j I don't count it because it's a Mixture of Experts model and the question intentionally specifies "dense language model" as I anticipated this and didn't want to count MoEs.
ampdot avatar
ampdotbought Ṁ1 of NODoes 1.75t Chinese LLM count? https://www.techradar.com/news/china-outstrips-gpt-3-with-even-more-ambitious-ai-language-model
StephenMalina avatar
Stephen Malinabought Ṁ100 of NORecent news from DeepMind suggests this would now be a >1 OOM parameter increase from the best model (https://twitter.com/MatthewJBar/status/1509262934639325188?s=20).