![](/_next/image?url=https%3A%2F%2Ffirebasestorage.googleapis.com%2Fv0%2Fb%2Fmantic-markets.appspot.com%2Fo%2Fdream%252FtsOnseGz_G.png%3Falt%3Dmedia%26token%3Dc33ede65-cef2-42e7-8270-41da3faf9b90&w=3840&q=75)
Will loss curves on Pythia models of different sizes trained on the same data in the same order be similar?
Basic
11
Ṁ748Nov 30
76%
chance
1D
1W
1M
ALL
Someone in the EleutherAI discord is reporting that finetuning Pythia models of different sizes on the same data in the same order is giving spookily similar loss curves, just vertically shifted.
![](https://firebasestorage.googleapis.com/v0/b/mantic-markets.appspot.com/o/user-images%2Fdefault%2F95KzPpVpsB.png?alt=media&token=00044e08-e769-4eae-9e27-456c25b11c97)
Will training Pythia models from scratch in the same way produce similar behaviour?
Resolves N/A if it turns out the original result was just a bug or something like that.
Get Ṁ600 play money
Sort by:
Some evidence https://arxiv.org/abs/2305.18411
Won't bet in case it's less obious but for the record mi prediction is like 30%.
Have a vague model about how the behaviour happens because the finetuning is affecting the models in the same way but not something else that lowers the loss significantly.
But I'm very confused and don't really get wtf is happening here tbh.
More related questions
Related questions
Will GPT-4 be trained (roughly) compute-optimally using the best-known scaling laws at the time?
30% chance
Will GPT-4 improve on the Chinchilla scaling law?
43% chance
Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?
54% chance
Do scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve?
48% chance
Do scaling laws happen because models experience a ton of tiny phase changes which average out to a smooth curve?
59% chance
Are Mixture of Expert (MoE) transformer models generally more human interpretable than dense transformers?
52% chance
Will it be possible to disentangle most of the features learned by a model comparable to GPT-3 this decade? (1k subsidy)
55% chance
Will it be possible to disentangle most of the features learned by a model comparable to GPT-4 this decade?
39% chance
Will any transfer learning model, trained for any amount of time on one Atari environment, outperform the median human learning curve on most other Atari environments when transferred by 2026?
45% chance
Will we be able to estimate the feature importance curve or feature sparsity curve of real models? (2024 end)
62% chance