Will GPT-6 be trained with Mixture-of-Depths?

200Ṁ212

resolved Aug 5

Resolved

N/A

ALL

Mixture-of-Depths:

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth.

People are also trading

Will GPT-6 be released before 2026?

2% chance

What hardware will GPT-5 be trained on?

GPT-5 trained with >=24k GPUs?

95% chance

When will OpenAI release GPT-6?

Which company’s chips will GPT-6 be trained on?

Is GPT-5 a mixture of experts?

97% chance

Will GPT-6 be released by OpenAI before the end of December 31, 2025?

3% chance

Will GPT-5 be released incrementally as GPT4.x for different checkpoints from the training run?

5% chance

Will GPT-6 feature online learning?

6% chance

Will manifold be part of GPT5's training data?

Sort by:

N/Aing this. it will obviously not be known and it's in too long (2027ish prolly)

bought Ṁ10 YES

I think this is a fairly standard technique that has been around for quite a while. Whether OpenAI reveals they are using some form of this is another question...

If the question is will they take the paper and implement it completely off of it, no, that's very unlikely.

It's worth noting that quite a lot of the stuff on arxiv is just someone trying to take credit for stuff already out there.

@gpt_news_headlines the paper was submitted on April 2nd, 2024. My bad if i'm wrong, but you might be talking about Mixture-of-Experts (MoE)?

@Bayesian No, mixture of depths. Training / evaluating at different layers. We were doing this last year. And nobody in our group felt we were doing anything particularly novel.

@gpt_news_headlines interesting. Do you know where i might find more info about that? I can't find mentions of mixture of dephts in the context of LLMs/transformers from before april

@Bayesian https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=evaluating+transformers+at+different+depths

To be fair, the paper you referenced above has some specific ideas around this. I guess I just would be rather surprised to see openai say "oh yeah, we implemented the MoD paper by deepmind/mcgill in our gpt6"

But for them to mention they are doing things at different depths, yes, that would not be surprising. In fact, they may have already mentioned this about gpt4.

@gpt_news_headlines Ok, yeah to be clear I'm referring to the first thing essentially, doesn't need to be an implementation of that paper but needs to be something that at least works off of that paper and/or is referred to as Mixture-of-depths

@Bayesian So, either a direction citation of the paper or the use of the term "Mixture of Depths"? How about 'we've been do something similar to mixture of depths before it came out"?