Will someone explain to me why modern LLMs are not trained with dropout?
6
150Ṁ96
resolved Jun 17
Resolved
YES

Get
Ṁ1,000
to start trading!

🏅 Top traders

Sort by:

?


I think dropout became a lot less necessary after techniques like BatchNorm and LayerNorm started to get big. They're just all around better regularization techniques.

@jonsimon GPT-1 uses dropout. GPT-2, 3 don't mention it. BLOOM says that it doesn't use dropout explicitly, Chinchilla doesn't mention dropout or weight decay, PALM mentions weight decay but not dropout. Llama mentions weight decay but not dropout.

Don't need regularization when you only see each example once?

@NoaNabeshima Empirically degrades performance when you only see each example once?

© Manifold Markets, Inc.TermsPrivacy