Mixture-of-Depths:
https://arxiv.org/abs/2404.02258
Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth.
See also:
I think this is a fairly standard technique that has been around for quite a while. Whether OpenAI reveals they are using some form of this is another question...
If the question is will they take the paper and implement it completely off of it, no, that's very unlikely.
It's worth noting that quite a lot of the stuff on arxiv is just someone trying to take credit for stuff already out there.
@gpt_news_headlines the paper was submitted on April 2nd, 2024. My bad if i'm wrong, but you might be talking about Mixture-of-Experts (MoE)?
@Bayesian No, mixture of depths. Training / evaluating at different layers. We were doing this last year. And nobody in our group felt we were doing anything particularly novel.
@gpt_news_headlines interesting. Do you know where i might find more info about that? I can't find mentions of mixture of dephts in the context of LLMs/transformers from before april
@Bayesian https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=evaluating+transformers+at+different+depths
To be fair, the paper you referenced above has some specific ideas around this. I guess I just would be rather surprised to see openai say "oh yeah, we implemented the MoD paper by deepmind/mcgill in our gpt6"
But for them to mention they are doing things at different depths, yes, that would not be surprising. In fact, they may have already mentioned this about gpt4.
@gpt_news_headlines Ok, yeah to be clear I'm referring to the first thing essentially, doesn't need to be an implementation of that paper but needs to be something that at least works off of that paper and/or is referred to as Mixture-of-depths
@Bayesian So, either a direction citation of the paper or the use of the term "Mixture of Depths"? How about 'we've been do something similar to mixture of depths before it came out"?