Will replacing LayerNorm with something that doesn't use current vector statistics remove outlier channels?
12
270Ṁ333
Dec 2
27%
chance

Will replacing LayerNorm variance and expectation with some other numbers that don't depend on the current hidden state remove outlier features in the model hidden states? Requires the replacement to not significantly degrade the final loss.

If outlier channels still exist but are at least halved in mean of absolute value, resolves at 80%.

Edit: if I think outlier channels are the tail of some smooth distribution, still can resolve Yes/No if the tail gets squashed to be much lower magnitude.

If I end up not thinking there are outlier channels, resolves N/A.

Get
Ṁ1,000
to start trading!
Sort by:

@traders anyone know how this should resolve? Or how to contact the creator for a resolution?

@EvanDaniel Hi, I can be contacted. I did not reach a conclusion.

@EvanDaniel I still might have an answer later on. If this is esp. dissatisfying to traders I could resolve N/A or extend market close.

@NoaNabeshima Did you reach a conclusion?

@EvanDaniel No, sorry for not responding earlier.

Girl if you've solved this then you've solved a much, much bigger problem.

This work seems fairly strong evidence against: https://transformer-circuits.pub/2023/privileged-basis/index.html and suggests it's predominantly Adam, though they don't explicitly replace LN with eg batchnorm, just simplify it to avoid engaging with the residual stream basis. One test would be looking for outliers in eg Q/K/V vectors of a head - I predict that they exist, and LayerNorm never sees this basis.

@NeelNanda The hypothesis I had was that layernorm adds a large mean direction/bias in the embeddings (to influence the denominator of the layernorm), not ruling out the possibility that Adam makes this bias/direction basis-aligned.

It's different from the hypothesis that the layernorm gamma makes things channel-aligned.

In that paper they still use RMS in the denominator in their layernorm test.

@NoaNabeshima I'm confused - the reason RMS norm is better than layernorm is because subtracting the mean removes the direction [1, 1, 1,...].

While this may create artifacts, they wouldn't show up in any single channel.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules