Will replacing LayerNorm variance and expectation with some other numbers that don't depend on the current hidden state remove outlier features in the model hidden states? Requires the replacement to not significantly degrade the final loss.

If outlier channels still exist but are at least halved in mean of absolute value, resolves at 80%.

Edit: if I think outlier channels are the tail of some smooth distribution, still can resolve Yes/No if the tail gets squashed to be much lower magnitude.

If I end up not thinking there are outlier channels, resolves N/A.

Related questions