Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?

124

1kṀ27k

resolved Apr 2

Resolved

ALL

There's a news report, paper, or blog post that I consider trustworthy reporting on a 1T parameter dense language model that was trained before January 1st, 2023.

Close date updated to 2022-12-31 11:59 pm Jun 16, 10:15pm: Since there's been some confusion in the comments, I want to clarify that "dense" here is in contrast to sparse models like mixture-of-experts. I've clarified more in a comment below, from which I've copied the relevant snippet:

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

EDIT: In the comments, Lauro Langosco di Langosco pointed out that it's possible we won't know as of the end of the year whether someone trained a model with 1T params in 2022 and just hadn't announced it yet. He also noted that the current phrasing of the question is very focused on trained vs. announced. Given that, my compromise here is to wait 3 months after the start of 2023 to resolve this rather than doing it right away. If noone announces something they trained with 1T parameters by then, I'll resolve "No".

EDIT 2: Lauro pushed back on three months as being too short, so I'll wait another year to resolve.

EDIT 3: After further deliberation, I've decided to stick with 3 months after 01/01/2023 for resolution. Another clarification I wanted to make is I won't definitely resolve positively unless the 1T model seems at least competitive with existing SoTA models. As mentioned in the comments, if someone trains a 1T parameter dense model as good as say davinci-002 or something, I'll have to decide what I'm going to do. I don't think that's likely enough to pre-plan for though.

Dec 12, 10:40am: ~~Will someone train a 1T parameter dense language model model this year?~~ → Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?

Technical AI Timelines

Science

Please Resolve

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ877
2		Ṁ671
3		Ṁ421
4		Ṁ240
5		Ṁ200

People are also trading

Will any 10 trillion+ parameter language model that follows instructions be released to the public before 2026?

18% chance

Will Transformer based architectures still be SOTA for language modelling by 2026?

91% chance

Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?

46% chance

Will a Large Language Model be listed as an author on a peer-reviewed paper by the end of 2025?

34% chance

Will a Large Language Model save a human life through medical advice by the end of 2025?

94% chance

Will a standardized category theory language for ML models emerge by end of 2025?

3% chance

13% chance

End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 2026

59% chance

Will anyone train a TokenFormer model at scale before 2026?

26% chance

How many distinct companies will hold the spot for [my favorite language model for >= 1 contiguous month] in 2026?

71 Comments

94 Holders

818 Trades

Sort by:

predictedNO

Before I pull the final trigger I'm looking for feedback here. Given that there's no information that's going to confirm GPT4's parameters, I'm inclined to resolve this to no. Any major objections?

predictedYES

@StephenMalina I think that given the description it would be better to wait until the 1st of April, but otherwise, I don't have any objections.

predictedNO

@MikhailDoroshenko yep just starting the conversation now so I can be good to go then.

predictedNO

@StephenMalina

predictedNO

@NoaNabeshima if you're asking me for a resolution, it hasn't been three months yet (see description). Will resolve on April 1st.

predictedNO

@StephenMalina Right, sorry!

Does this count? https://arxiv.org/abs/2110.03888

"We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days"

predictedNO

@DanElton I don't think so.
1) These seem at least partially like mixture-of-expert models:
> M6 is built with stacking transformer layers, which includes self attention and feed-forward neural nets (FFN). For the transformation from dense models to sparse expert models, we should only replace FFN layers with the Mixture-of-Expert (MoE) layers. MoE consists of multiple experts, which are usually FFNs distributed on different devices. A gating network decides the dispatching and combining behaviors of each token and thus tokens can be processed in diverse devices. Such mechanism is a combination of data parallelism and expert parallelism, and thus it is highly efficient though with large model capacity. For the training, to realize the learning of both understanding and generation, the model is trained with text denoising and language modeling on plain text data and with image-based text denoising and image captioning on multimodal data. The model is compatible with different types of downstream tasks and can process information of multiple modalities.

2) I specified in this comment that the model had to be competitive and this model is seemingly not at all.

:P
@rocketsan

predictedNO

@rocketsan I'm curious about your models-- why do you think someone will train a 1T parameter transformer?

predictedNO

https://twitter.com/MatthewJBar/status/1605328983968477184

predictedNO

My earlier joke aside, what is the cutoff for quality? Does it need to be near SOTA? If someone trains a 1T parameter model to, say, GPT-2 quality would that resolve YES?

predictedNO

@vluzko this isn't in the original description but the spirit of the question was a model that's at least trying to be somewhat better than existing ones.

predictedNO

@vluzko the biggest edge case is if someone trains something that's original or davinci-002 level quality with 1T dense parameters. I honestly don't know what I'd do then, kind of just hoping it doesn't happen. If it does, I'll solicit feedback from market participants.

Suppose that GPT-3 -> GPT-4 is a 2 OOM compute increase. If costs scale linearly, that would be $250~500M, which would be an ambitious but not implausible percentage of the OpenAI and DeepMind budget [DM annual budget ~1.7B, OA budget .2~.4B? less certain about OA budget than DM budget]. Probably budget isn't the right way of thinking about it, given that DM and OA have their own hardware, but it still gives a sense of the scale. That kind of spending doesn't look like it would quite reach 1T if the model is Chinchilla trained (eyeballing these graphs: https://arxiv.org/pdf/2203.15556.pdf). It looks like it might be another OOM to get to 1T, which would be out of OA's budget for GPT-4.

I'm not sure how to think about Google Brain, but I'm imagining that they have within a factor of 2 the budget of DeepMind.

predictedNO

@NoaNabeshima but GPT3 is already not dense, GPT4 probably also isn't.

predictedNO

@L (though to be clear that doesn't contradict your point, it makes your analysis unnecessarily weak evidence for your point.)

predictedYES

@NoaNabeshima I don't think that's right: GPT3 was 175b params on 500b tokens. Scaling to 1T params and 20T tokens (chinchilla-optimal) means roughly 6x params and 40x compute, which gives 240x FLOP/s. That seems not crazy, especially given that compute costs probably scale sublinearly

predictedNO

@L I thought GPT-3 would be considered dense for the purposes of this market

> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.

@Lauro Oh, nice! Why do you think compute costs scale sublinearly?

predictedNO

How sublinearly are you imagining?

predictedNO

@Lauro GPT-3 was trained on 300B tokens
https://arxiv.org/pdf/2005.14165.pdf

predictedNO

@Lauro Metaculus predicts GPT-4 will use 14.4X FLOPs of GPT-3
https://lambdalabs.com/blog/demystifying-gpt-3
https://www.metaculus.com/questions/9519/flops-used-for-gpt-4-if-released/

predictedNO

@L GPT3 isn't dense in a different sense - MoE vs banded attention.

predictedNO

@vluzko I agree that it isn't dense in that sense, but my guess is that GPT-3 is dense in the spirit of the market. @StephenMalina ?

@Lauro Think my work might be wrong, but I think Metaculus community predicts ~3% chance of GPT-4 having that many FLOPs

predictedNO

or more

predictedNO

@NoaNabeshima yes that's correct, GPT3 is dense by this definition!

@L GPT3 isn't dense?

predictedNO

@LostFutures It is dense in the sense of this market - it isn't a mixture of experts model where only a fraction of the parameters are actually used in any forward/backward pass.

But it "isn't dense" - or rather it is sparse - in that some of the attention layers are locally banded.

predictedYES

@NoaNabeshima oh good point! So if GPT3 is 300b tokens, that would make it 6x params and 66x tokens, which makes 400x FLOP/s.

Re sublinear: I expect compute costs to have gone down somewhat since GPT3 was trained, and I also expect standard economies of scale to help a little. I don't have a good sense for how much costs go down - I guess I'd expect compute costs to be something like 30%-80% of the GPT3 per FLOP cost? Don't take this estimate seriously though.

FWIW I think the metaculus market is miscalibrated. I'd put the probability at ~30% maybe

predictedNO

@Lauro hmmm, I don't have models of compute costs over time, so I'm just throwing thoughts around but do you think the chip shortage would affect things?
I'm surprised you think the metaculus market is miscalibrated in a particular direction. I do notice you've been profiting on Manifold. I'm curious about when you expect to beat Metaculus.