Will someone explain to me why modern LLMs are not trained with dropout?
6
207
150
resolved Jun 17
Resolved
YES

Get Ṁ200 play money

🏅 Top traders

Sort by:

?


I think dropout became a lot less necessary after techniques like BatchNorm and LayerNorm started to get big. They're just all around better regularization techniques.

@jonsimon GPT-1 uses dropout. GPT-2, 3 don't mention it. BLOOM says that it doesn't use dropout explicitly, Chinchilla doesn't mention dropout or weight decay, PALM mentions weight decay but not dropout. Llama mentions weight decay but not dropout.

Don't need regularization when you only see each example once?

@NoaNabeshima Empirically degrades performance when you only see each example once?