
Will replacing LayerNorm variance and expectation with some other numbers that don't depend on the current hidden state remove outlier features in the model hidden states? Requires the replacement to not significantly degrade the final loss.
If outlier channels still exist but are at least halved in mean of absolute value, resolves at 80%.
Edit: if I think outlier channels are the tail of some smooth distribution, still can resolve Yes/No if the tail gets squashed to be much lower magnitude.
If I end up not thinking there are outlier channels, resolves N/A.
@EvanDaniel I still might have an answer later on. If this is esp. dissatisfying to traders I could resolve N/A or extend market close.
This work seems fairly strong evidence against: https://transformer-circuits.pub/2023/privileged-basis/index.html and suggests it's predominantly Adam, though they don't explicitly replace LN with eg batchnorm, just simplify it to avoid engaging with the residual stream basis. One test would be looking for outliers in eg Q/K/V vectors of a head - I predict that they exist, and LayerNorm never sees this basis.
@NeelNanda The hypothesis I had was that layernorm adds a large mean direction/bias in the embeddings (to influence the denominator of the layernorm), not ruling out the possibility that Adam makes this bias/direction basis-aligned.
It's different from the hypothesis that the layernorm gamma makes things channel-aligned.
In that paper they still use RMS in the denominator in their layernorm test.
@NoaNabeshima I'm confused - the reason RMS norm is better than layernorm is because subtracting the mean removes the direction [1, 1, 1,...].
While this may create artifacts, they wouldn't show up in any single channel.