Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?
124
428
1K
resolved Apr 2
Resolved
NO

There's a news report, paper, or blog post that I consider trustworthy reporting on a 1T parameter dense language model that was trained before January 1st, 2023.

Close date updated to 2022-12-31 11:59 pm Jun 16, 10:15pm: Since there's been some confusion in the comments, I want to clarify that "dense" here is in contrast to sparse models like mixture-of-experts. I've clarified more in a comment below, from which I've copied the relevant snippet:

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

EDIT: In the comments, Lauro Langosco di Langosco pointed out that it's possible we won't know as of the end of the year whether someone trained a model with 1T params in 2022 and just hadn't announced it yet. He also noted that the current phrasing of the question is very focused on trained vs. announced. Given that, my compromise here is to wait 3 months after the start of 2023 to resolve this rather than doing it right away. If noone announces something they trained with 1T parameters by then, I'll resolve "No".

EDIT 2: Lauro pushed back on three months as being too short, so I'll wait another year to resolve.

EDIT 3: After further deliberation, I've decided to stick with 3 months after 01/01/2023 for resolution. Another clarification I wanted to make is I won't definitely resolve positively unless the 1T model seems at least competitive with existing SoTA models. As mentioned in the comments, if someone trains a 1T parameter dense model as good as say davinci-002 or something, I'll have to decide what I'm going to do. I don't think that's likely enough to pre-plan for though.

Dec 12, 10:40am: Will someone train a 1T parameter dense language model model this year? → Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ877
2Ṁ671
3Ṁ421
4Ṁ240
5Ṁ200
Sort by:
predicted NO

Before I pull the final trigger I'm looking for feedback here. Given that there's no information that's going to confirm GPT4's parameters, I'm inclined to resolve this to no. Any major objections?

predicted YES

@StephenMalina I think that given the description it would be better to wait until the 1st of April, but otherwise, I don't have any objections.

predicted NO

@MikhailDoroshenko yep just starting the conversation now so I can be good to go then.

predicted NO
predicted NO

@NoaNabeshima if you're asking me for a resolution, it hasn't been three months yet (see description). Will resolve on April 1st.

predicted NO

@StephenMalina Right, sorry!

Does this count? https://arxiv.org/abs/2110.03888

"We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days"

predicted NO

@DanElton I don't think so.
1) These seem at least partially like mixture-of-expert models:
> M6 is built with stacking transformer layers, which includes self attention and feed-forward neural nets (FFN). For the transformation from dense models to sparse expert models, we should only replace FFN layers with the Mixture-of-Expert (MoE) layers. MoE consists of multiple experts, which are usually FFNs distributed on different devices. A gating network decides the dispatching and combining behaviors of each token and thus tokens can be processed in diverse devices. Such mechanism is a combination of data parallelism and expert parallelism, and thus it is highly efficient though with large model capacity. For the training, to realize the learning of both understanding and generation, the model is trained with text denoising and language modeling on plain text data and with image-based text denoising and image captioning on multimodal data. The model is compatible with different types of downstream tasks and can process information of multiple modalities.

2) I specified in this comment that the model had to be competitive and this model is seemingly not at all.

bought Ṁ200 of NO
predicted NO

@rocketsan I'm curious about your models-- why do you think someone will train a 1T parameter transformer?

predicted NO

My earlier joke aside, what is the cutoff for quality? Does it need to be near SOTA? If someone trains a 1T parameter model to, say, GPT-2 quality would that resolve YES?

predicted NO

@vluzko this isn't in the original description but the spirit of the question was a model that's at least trying to be somewhat better than existing ones.

predicted NO

@vluzko the biggest edge case is if someone trains something that's original or davinci-002 level quality with 1T dense parameters. I honestly don't know what I'd do then, kind of just hoping it doesn't happen. If it does, I'll solicit feedback from market participants.

bought Ṁ300 of NO

Suppose that GPT-3 -> GPT-4 is a 2 OOM compute increase. If costs scale linearly, that would be $250~500M, which would be an ambitious but not implausible percentage of the OpenAI and DeepMind budget [DM annual budget ~1.7B, OA budget .2~.4B? less certain about OA budget than DM budget]. Probably budget isn't the right way of thinking about it, given that DM and OA have their own hardware, but it still gives a sense of the scale. That kind of spending doesn't look like it would quite reach 1T if the model is Chinchilla trained (eyeballing these graphs: https://arxiv.org/pdf/2203.15556.pdf). It looks like it might be another OOM to get to 1T, which would be out of OA's budget for GPT-4.

I'm not sure how to think about Google Brain, but I'm imagining that they have within a factor of 2 the budget of DeepMind.

predicted NO

@NoaNabeshima but GPT3 is already not dense, GPT4 probably also isn't.

predicted NO

@L (though to be clear that doesn't contradict your point, it makes your analysis unnecessarily weak evidence for your point.)

predicted YES

@NoaNabeshima I don't think that's right: GPT3 was 175b params on 500b tokens. Scaling to 1T params and 20T tokens (chinchilla-optimal) means roughly 6x params and 40x compute, which gives 240x FLOP/s. That seems not crazy, especially given that compute costs probably scale sublinearly

predicted NO

@L I thought GPT-3 would be considered dense for the purposes of this market

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

bought Ṁ0 of NO

@Lauro Oh, nice! Why do you think compute costs scale sublinearly?

predicted NO

How sublinearly are you imagining?

predicted NO

@Lauro GPT-3 was trained on 300B tokens
https://arxiv.org/pdf/2005.14165.pdf

predicted NO

@L GPT3 isn't dense in a different sense - MoE vs banded attention.

predicted NO

@vluzko I agree that it isn't dense in that sense, but my guess is that GPT-3 is dense in the spirit of the market. @StephenMalina ?

sold Ṁ239 of NO

@Lauro Think my work might be wrong, but I think Metaculus community predicts ~3% chance of GPT-4 having that many FLOPs

predicted NO

or more

predicted NO

@NoaNabeshima yes that's correct, GPT3 is dense by this definition!

@L GPT3 isn't dense?

predicted NO

@LostFutures It is dense in the sense of this market - it isn't a mixture of experts model where only a fraction of the parameters are actually used in any forward/backward pass.

But it "isn't dense" - or rather it is sparse - in that some of the attention layers are locally banded.

predicted YES

@NoaNabeshima oh good point! So if GPT3 is 300b tokens, that would make it 6x params and 66x tokens, which makes 400x FLOP/s.

Re sublinear: I expect compute costs to have gone down somewhat since GPT3 was trained, and I also expect standard economies of scale to help a little. I don't have a good sense for how much costs go down - I guess I'd expect compute costs to be something like 30%-80% of the GPT3 per FLOP cost? Don't take this estimate seriously though.

FWIW I think the metaculus market is miscalibrated. I'd put the probability at ~30% maybe

predicted NO

@Lauro hmmm, I don't have models of compute costs over time, so I'm just throwing thoughts around but do you think the chip shortage would affect things?
I'm surprised you think the metaculus market is miscalibrated in a particular direction. I do notice you've been profiting on Manifold. I'm curious about when you expect to beat Metaculus.

More related questions