
Conditional on being a transformer.
Nov 25, 11:11pm: Will GPT-4 be a dense model? → GPT-4 #5: Will GPT-4 be a dense model?
@HoraceHe somebody sold "YES" shares. Also, some people bought "NO".
@HoraceHe Personally, I want the liquidity for March 31st resolutions in other markets.
@vluzko "dense" here refers the standard machine learning usage of mostly non-zero parameters, right?

Meta's future looks grim,
As layoffs start to skim.
Numbers dwindling, profits slim,
More layoffs seem quite prim.
Conspiracy theory: OpenAI may be choosing not to reveal GPT-4's architecture not just because of continuously increasing stakes of safety and profit but because there was some major architectural shift. Given that there none of the recent innovations seem super promising (e.g., autoregressive diffusion models), this shift may have been to something already well-established—but not frequently used in LLMs—like spare encoding or mixture of experts.

It seems very unlikely that the release of GPT-4 will resolve this. However the market does not close until 2027, and I will leave it open until then in case the information is either (credibly) leaked or OpenAI decides to release it after the fact. If neither of those occurs the market will resolve N/A.
I don't see architecture details in the blog post or the technical report.

@wadimiusz "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

thank you guys for continuing to create arb opportunities by making the probabilities of this question + the mixture of expert question sum to more than 100%
@AlexAmadori Given they have different end-dates, you may find yourself with delicate hedging decisions to make before July.
I'm confused why the market expects GPT-4 to be a Mixture of Experts model. I think it's ~85% likely that GPT-4 will be a dense model.
As far as I know, OpenAI has not described training large-scale MoE models. If OpenAI were to make a huge bet on MoE, we should expect them to first derisk this bet by training and validating smaller MoE models before spending millions of dollars worth of compute on a MoE GPT-4
All prior GPT models (1 through 3.5) were dense. OpenAI is bullish on scaling dense LLMs, and has not, as far as I know, indicated an intention to abort this strategy of scaling dense LLMs.
Dense models are the most common type of model among the largest-scale experiments. As far as I can tell, the largest-scale MoE LLM (by compute) was Switch Transformers, which at ~2.8e22 FLOP is quite small compared to the largest models trained by Google, Meta, Microsoft, and so on. Since GPT-4 is likely to involve a lot of compute, we should expect it to look be similar in many respects to PaLM, Megatron-Turing NLG, OPT-175B, Gopher, Chinchilla, LaMDA, Bloom, etc.

@TamayBesiroglu I think the market is being driven by people who have 1. heard rumors that GPT-4 is 10T parameters or whatever 2. also heard that 10T parameter dense models are totally bonkers 3. concluded that it must be MoE.
Before the 3.5 release it was a lot more reasonable to guess that maybe it would be MoE, but if they were switching to MoE why would they burn a ton of money training a dense 3.5? So really the question is whether OpenAI has learned in the last year that MoE is better, which seems pretty unlikely.
@vluzko Fair. A while back, I asked Sam Altman how many parameters GPT-4 would have, and he said roughly that it wouldn't be much larger than GPT-3.


@quadrilateral we have every reason to believe that 3.5 is 3 trained with closer to optimal tokens

I'm betting YES partially to hedge against another market I have a large position in that is closed.
https://manifold.markets/StephenMalina/will-someone-train-a-1t-parameter-d


What if the model will have "global memory" akin to the one described in the RETRO paper? It isn't triggered every token inference and so the whole NN won't be run every token. Would it then be considered sparse?

If GPT-4 has a parameter that is always used for sufficiently long sequence lengths but not used on shorter sequence lengths, is it considered dense?
@NoaNabeshima oh, I guess this is trivially the case for position embeddings
Does GPT-3 count as a dense model? Or are you asking if all parameters are necessarily involved in computing a forward pass w/ batch size 1?

@NoaNabeshima Yeah GPT-3 counts as dense. This is about whether it will be a mixture of experts.


Reddit comment screenshot. Credit: Igor Baikov (shared by Gwern)










Related markets


Related markets

