See this blog post: https://www.evanmiller.org/attention-is-off-by-one.html, and in particular this paragraph:
Even though softmax1 is facially quite boring, I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research. If you want to run some experiments and prove me right, DM me on Twitter and we’ll get a paper going.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ25 | |
2 | Ṁ13 | |
3 | Ṁ5 | |
4 | Ṁ4 | |
5 | Ṁ3 |
@NoaNabeshima This strongly suggests that the outlier channels are at least partially because of the softmax, especially when softmax wants to attend to a token with 0 probability (!)

You might imagine that softmax_1 doesn't solve this because it only helps the head not-attend-much-to-anything but doesn't allow the head to easily attend truly zero amt to any token.
@marketwise Wait. I accidentally subsidized (2500M) and boosted (2500M) this "control group" question instead of the other one. 🤣 Enjoy the liquidity!
It resolves as yes, if someone runs experiments that show that this modified EDIT: softmax solves the outlier features issue with quantization.
I haven't thought quantitatively about what would count as "solving the outlier features problem" yet. What I have in mind is, results on par with these fixes. I'm open to suggestions of more concrete criteria.
Currently I would resolve as no, all I could find during a quick search was this post which I think doesn't present good enough evidence?