There's a news report, paper, or blog post that I consider trustworthy reporting on a 1T parameter dense language model that was trained before January 1st, 2023.
Close date updated to 2022-12-31 11:59 pm Jun 16, 10:15pm: Since there's been some confusion in the comments, I want to clarify that "dense" here is in contrast to sparse models like mixture-of-experts. I've clarified more in a comment below, from which I've copied the relevant snippet:
> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.
EDIT: In the comments, Lauro Langosco di Langosco pointed out that it's possible we won't know as of the end of the year whether someone trained a model with 1T params in 2022 and just hadn't announced it yet. He also noted that the current phrasing of the question is very focused on trained vs. announced. Given that, my compromise here is to wait 3 months after the start of 2023 to resolve this rather than doing it right away. If noone announces something they trained with 1T parameters by then, I'll resolve "No".
EDIT 2: Lauro pushed back on three months as being too short, so I'll wait another year to resolve.
EDIT 3: After further deliberation, I've decided to stick with 3 months after 01/01/2023 for resolution. Another clarification I wanted to make is I won't definitely resolve positively unless the 1T model seems at least competitive with existing SoTA models. As mentioned in the comments, if someone trains a 1T parameter dense model as good as say davinci-002 or something, I'll have to decide what I'm going to do. I don't think that's likely enough to pre-plan for though.
Dec 12, 10:40am: Will someone train a 1T parameter dense language model model this year? → Will someone train a 1T parameter dense (non-sparsely activated mixture of experts) language model this year?
@StephenMalina I think that given the description it would be better to wait until the 1st of April, but otherwise, I don't have any objections.
@NoaNabeshima if you're asking me for a resolution, it hasn't been three months yet (see description). Will resolve on April 1st.
Does this count? https://arxiv.org/abs/2110.03888
"We demonstrate a practice of pretraining unprecedented 10-trillion-parameter model, an order of magnitude larger than the state-of-the-art, on solely 512 GPUs within 10 days"
@DanElton I don't think so.
1) These seem at least partially like mixture-of-expert models:
> M6 is built with stacking transformer layers, which includes self attention and feed-forward neural nets (FFN). For the transformation from dense models to sparse expert models, we should only replace FFN layers with the Mixture-of-Expert (MoE) layers. MoE consists of multiple experts, which are usually FFNs distributed on different devices. A gating network decides the dispatching and combining behaviors of each token and thus tokens can be processed in diverse devices. Such mechanism is a combination of data parallelism and expert parallelism, and thus it is highly efficient though with large model capacity. For the training, to realize the learning of both understanding and generation, the model is trained with text denoising and language modeling on plain text data and with image-based text denoising and image captioning on multimodal data. The model is compatible with different types of downstream tasks and can process information of multiple modalities.
2) I specified in this comment that the model had to be competitive and this model is seemingly not at all.
@rocketsan I'm curious about your models-- why do you think someone will train a 1T parameter transformer?
@vluzko this isn't in the original description but the spirit of the question was a model that's at least trying to be somewhat better than existing ones.
@vluzko the biggest edge case is if someone trains something that's original or davinci-002 level quality with 1T dense parameters. I honestly don't know what I'd do then, kind of just hoping it doesn't happen. If it does, I'll solicit feedback from market participants.
Suppose that GPT-3 -> GPT-4 is a 2 OOM compute increase. If costs scale linearly, that would be $250~500M, which would be an ambitious but not implausible percentage of the OpenAI and DeepMind budget [DM annual budget ~1.7B, OA budget .2~.4B? less certain about OA budget than DM budget]. Probably budget isn't the right way of thinking about it, given that DM and OA have their own hardware, but it still gives a sense of the scale. That kind of spending doesn't look like it would quite reach 1T if the model is Chinchilla trained (eyeballing these graphs: https://arxiv.org/pdf/2203.15556.pdf). It looks like it might be another OOM to get to 1T, which would be out of OA's budget for GPT-4.
I'm not sure how to think about Google Brain, but I'm imagining that they have within a factor of 2 the budget of DeepMind.
@L (though to be clear that doesn't contradict your point, it makes your analysis unnecessarily weak evidence for your point.)
@NoaNabeshima I don't think that's right: GPT3 was 175b params on 500b tokens. Scaling to 1T params and 20T tokens (chinchilla-optimal) means roughly 6x params and 40x compute, which gives 240x FLOP/s. That seems not crazy, especially given that compute costs probably scale sublinearly
@L I thought GPT-3 would be considered dense for the purposes of this market
> It's not "transformer" vs. not either though, it's specifically focused on the distinction between mixture of experts model (ex: https://arxiv.org/abs/1701.06538) which are sparsely activated vs. dense models which use all their nodes for each forward pass.
@Lauro Metaculus predicts GPT-4 will use 14.4X FLOPs of GPT-3
https://lambdalabs.com/blog/demystifying-gpt-3
https://www.metaculus.com/questions/9519/flops-used-for-gpt-4-if-released/
@vluzko I agree that it isn't dense in that sense, but my guess is that GPT-3 is dense in the spirit of the market. @StephenMalina ?
@Lauro Think my work might be wrong, but I think Metaculus community predicts ~3% chance of GPT-4 having that many FLOPs
@LostFutures It is dense in the sense of this market - it isn't a mixture of experts model where only a fraction of the parameters are actually used in any forward/backward pass.
But it "isn't dense" - or rather it is sparse - in that some of the attention layers are locally banded.
@NoaNabeshima oh good point! So if GPT3 is 300b tokens, that would make it 6x params and 66x tokens, which makes 400x FLOP/s.
Re sublinear: I expect compute costs to have gone down somewhat since GPT3 was trained, and I also expect standard economies of scale to help a little. I don't have a good sense for how much costs go down - I guess I'd expect compute costs to be something like 30%-80% of the GPT3 per FLOP cost? Don't take this estimate seriously though.
FWIW I think the metaculus market is miscalibrated. I'd put the probability at ~30% maybe
@Lauro hmmm, I don't have models of compute costs over time, so I'm just throwing thoughts around but do you think the chip shortage would affect things?
I'm surprised you think the metaculus market is miscalibrated in a particular direction. I do notice you've been profiting on Manifold. I'm curious about when you expect to beat Metaculus.