Is GPT-4 a mixture of experts?
131
3.3K
2K
resolved Mar 19
Resolved
YES

MoE

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ1,317
2Ṁ721
3Ṁ551
4Ṁ550
5Ṁ265
Sort by:

resolved YES based on this for anyone wondering

bought Ṁ100 of YES

I think there might be a hint here that gpt-4 uses moe

predicted YES

I've reopened this market for trading because there has been no official confirmation that GPT-4 is a Mixture of Experts.

How is that this market isn't resolved to YES? It seems there is a broad consensus on the subject

predicted YES

@MP Hasn't been confirmed by OpenAI yet. It could be a psy-op to mislead the other AI labs into pursuing dead-end research.

predicted NO

@MP "broad consensus" is an overexaggeration. there's been some leaks

@ShadowyZephyr per SemiAnalysis

LLM inference in most current use cases is to operate as a live assistant, meaning it must achieve throughput that is high enough that users can actually use it. Humans on average read at ~250 words per minute but some reach as high as ~1,000 words per minute. This means you need to output at least 8.33 tokens per second, but more like 33.33 tokens per second to cover all corner cases.
A trillion-parameter dense model mathematically cannot achieve this throughput on even the newest Nvidia H100 GPU servers due to memory bandwidth requirements. Every generated token requires every parameter to be loaded onto the chip from memory. That generated token is then fed into the prompt and the next token is generated. Furthermore, additional bandwidth is required for streaming in the KV cache for the attention mechanism.
The chart above demonstrates the memory bandwidth required to inference an LLM at high enough throughput to serve an individual user. It shows that even 8x H100 cannot serve a 1 trillion parameter dense model at 33.33 tokens per second. Furthermore, the FLOPS utilization rate of the 8xH100’s at 20 tokens per second would still be under 5%, resulting is horribly high inference costs. Effectively there is an inference constraint around ~300 billion feed-forward parameters for an 8-way tensor parallel H100 system today.
Yet OpenAI is achieving human reading speed, with A100s, with a model larger than 1 trillion parameters, and they are offering it broadly at a low price of only $0.06 per 1,000 tokens. That’s because it is sparse, IE not every parameter is used.

SemiAnalysis is a very reputable source and if they are saying it's impossible to have a 1T dense model (they say they talked with many people in many labs, including OAI), it's safe to say it's a MoE

Seems like there isn't enough evidence yet for this to resolve. Any OpenAI insiders want to let us know? :D

@chilli That is bullshit. Whole model routing is not how moe works

bought Ṁ1 of NO

Symbolic bet, because I don't think GPT-4 is MoE, but I think it's likely this will not resolve for a long time if ever.

Edit: People are saying it's MoE, interesting that OpenAI is still having issues with scaling lol

bought Ṁ100 of NO

Yes appears overvalued here. The GPT-4 dense model (which has 3x the volume of this market) is at 33% right now. MoE models are not dense, so these are mutually exclusive. However, there are more sparse model types than just Mixture of Experts. I don't believe this warrants a 91% (61/67) certainty that his sparse model turns out to be MoE, especially given the poor performance of these types of models in the past.

What if the different personas a dense LLM has, are each an expert, do we call it moe too xD /s

sold Ṁ251 of YES

If the 1T parameter article is correct, GPT-4 inference seems too slow for it to be a 1T parameter mixture of experts model

bought Ṁ95 of YES

@NoaNabeshima Notably ChatGPT inference seems a lot slower than Bing inference. That could just be because Bing uses smaller models sometimes.

Someone at Anthropic mentioned that OpenAI might be throtting gpt-3.5-turbo.

Once you get the prompt hidden states, you want to use those for generation as quickly as possibly on the same GPU/cluster of GPUs, right? Then if too many people are trying to generate simultaneously, I'm imagining that would just lend itself to longer lag at the start of generation but not lag in next-token streaming. It's possible OpenAI actually generates the next tokens quickly and then sends them to the user slowly (so that the user queries the API at a slower rate).

predicted YES

@NoaNabeshima Using GPT-4 via playground seems way faster than ChatGPT Plus right now.

bought Ṁ0 of YES

It's definitely an ensemble model, but I don't think it's a mixture of experts (i.e., I believe it consistently accesses all subnetworks without gating).

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

predicted NO
predicted YES

Comment hidden