Will someone explain to me why modern LLMs are not trained with dropout?

150Ṁ96

resolved Jun 17

Resolved

YES

ALL

Mechanistic interpretability

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ11
2		Ṁ8
3		Ṁ7
4		Ṁ4
5		Ṁ4

People are also trading

At the beginning of 2028, will LLMs still make egregious common-sensical errors?

56% chance

Will there be major breakthrough in LLM Continual Learning before 2026?

12% chance

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

4% chance

Will an LLM that someone is trying to shut down stop or avoid that in some way before 2026?

12% chance

How will the data shortage for LLM gets solved

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

8% chance

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

50% chance

Will LLMs be the best reasoning models on these dates?

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

50% chance

Will there be any major breakthrough in LLM continual learning before 2028?

Sort by:

@firstuserhere
Colab

@firstuserhere Very good read:

https://arxiv.org/pdf/2303.01500.pdf

I think dropout became a lot less necessary after techniques like BatchNorm and LayerNorm started to get big. They're just all around better regularization techniques.

@jonsimon Oh wait, I take it back, I think Transformers do actually use dropout, it's just buried in the implementation details. https://stats.stackexchange.com/a/545413

Where is dropout placed in the original transformer?

I wanted to know where dropout was placed in the original transformer. According to the original paper (https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) they say:

@jonsimon GPT-1 uses dropout. GPT-2, 3 don't mention it. BLOOM says that it doesn't use dropout explicitly, Chinchilla doesn't mention dropout or weight decay, PALM mentions weight decay but not dropout. Llama mentions weight decay but not dropout.