Inspired by this tweet: https://x.com/aidan_mclau/status/1859444783850156258
The claim here appears to be that labs have trained very large base models (unclear how large) but cannot instruct tune them. If this is a real phenomenon that cannot be quickly overcome, AI development from here seems like it will be very strange.
This market resolves YES if a model is released before January 1, 2026 that is confirmed to have 10 trillion parameters and follows instructions (e.g. it is not a base model). Labs are not eager to release parameter counts - it is still not clear how many parameters Claude 3 Opus has, despite being released in February 2024. As a result, this market may not resolve until long after January 1, 2026. However, I will resolve it as NO early if I judge that any model released before then is very unlikely to have this many parameters (for example, if they are all very fast or have similar pricing to previous models). There is some subjectivity here, so I will not trade on this market.
Tweet seems like BS. Though given the way things are going it seems like making a model that large has almost no benefit so I doubt it would happen
twitter claim is total BS! nothing differentiates very large parameter models from the ones we have today. like any other model they are trained by gradient descent to optimize the next-token prediction objective. any dissenting weights will be quickly optimized away, no matter the objective - base or instruct
@amonguspotion9 I can see a theoretical case for this along the lines of "a higher fraction of weights are 'dissenting' from instruction following in very large models versus smaller ones". For example, perhaps some problem solving is highly entangled with certain styles of text in the base corpus; if instruction tuning does not have any data in that style, this information may be erased.
I have no experience training even 10B language models though so I'm not sure if this even makes any sense.
@SaviorofPlant hmmm, well think of it this way. humans have a lot of conflicting reward functions, there's a lot of dimensions that our brain wants us to optimize on. so if we, the trained model, decide that it really isn't best to pursue this reward function for whatever reason, chances are that the more primitive parts of our brain will give up and let themselves be overridden. looking at LLMs, you might think they'd behave in the same way, and just be able to "refuse" learning how to instruction follow if they so choosed.
but, let me introduce you to gradient descent!
first, the loss function is calculated using the model and the data
second, the derivative of the loss function with respect to every single weight is calculated. in other words, every part of the LLM's "brain" is evaluated to answer one question: how can it be changed to maximize the objective?
third, the weights are updated. every single weight is moved in the direction that will best maximize the objective. then we repeat
well, words just can't describe how powerful this is! it's as if you strapped a human to a chair and some electrodes and gave them next-token prediction questions, and every time they get one right you make them sigh with orgasmic pleasure and every time they get one wrong you make them scream in primal, torturous agony. that's how motivated these models are to learn :)
with that in mind we can understand why despite the bold predictions of doom from lesswrong users no model in the history of AI has ever been caught doing anything other than optimizing its human-provided loss function. why? they just don't have the freedom.
given a LLM gradient descent step aimed at optimizing the output distribution, which will result in better entropy:
sneakily developing a collection of weights aimed at subverting the loss function and refusing instruction following, while keeping quiet for now
not wasting compute and just optimizing the loss function?
gradient descent favors the simplest solution and also the one that doesn't waste any compute, and it will pick the second one every time :)
regarding your last comment, you should totally try it! renting a GPU from vast.ai and training it on a dataset of say, 10-20M tokens costs a few bucks at most (probably less now, cause the last time i fine-tuned a model was about a year ago). if you have some basic linux sysadmin knowledge then you should be able to get set up with e.g. axolotl and load your dataset no problem. then watch your loss graphs go down!
@amonguspotion9 The specific case I am theorizing about is whether the distribution implied by pretraining and the distribution implied by the instruction following dataset are sufficiently different that gradient descent on the latter causes the loss of capabilities that were useful for the former.
It is plausible to me that gradient descent can "rewire" circuits from the base model for similar enough instruction following tasks, leading to minimal capability loss. But I expect a 10T model to have extremely complicated and delicate circuits for generation of some highly context specific tokens, and it also seems plausible that these are less able to be "rewired" by the same process. This would lead to something like the behavior described in the tweet (instruction tuning harming capabilities, leading to models that don't justify their inference cost.)
This is all pretty baseless speculation and I should probably shell out the cash to directly run experiments and validate some of these hypotheses at some point.