If the first AGIs come about from comparatively dumb LLMs being prompted in specific manners that force them to make their reasoning more explicit and outputted in a structured manner interpretable to us, like COT and reflection, will this allows us to make the first superhuman AGIs naturally interpretable?
We need more objective resolution criteria, especially since CoT itself might not be interpretable even if it looks that way because of things like steganography
@EmienerYunoski The question feels pretty fundamentally fraught. No one can even agree what interpretability means in much detail.
Maybe something close to the spirit of the question would be to resolve based on whether a model agreed to be an AGI builds any sort of propter hoc natural language log or artifact during the course of its cognition, and uses it to give its final answer.
Models that systematically use CoT during inference would qualify under this criterion even if it’s not fully clear if there are hard-to-detect shenanigans that make the chain unreflective of the actual process that produced the answer.
@AdamK If it is a language model, this is ridiculous.
Language models do not have any way to think privately to themselves - so if they put something into the text via steganography, they have no reason to believe their later selves will notice it rather than the plain meaning of the text.
This is same as cases where you tell ChatGPT to secretly think of a number; when it says it has a number in mind, it is simply plain out lying. It cannot think of a number without writing it out.
@DavidBolin When I think of canonical ways that CoT could be a poor representative of internal model reasoning, the paper mentioned in another comment (https://arxiv.org/abs/2305.04388) is the best example.
A model finetuned on an MC dataset where the answer is always (A) will confabulate false justifications for (A) on questions where the answer is not (A). This was the example in my mind when I wrote “Models that systematically use CoT during inference would qualify under this criterion even if it’s not fully clear if there are hard-to-detect shenanigans that make the chain unreflective of the actual process that produced the answer.”
This is indeed not an instance of steganography, and I don’t think steganography is the only way that CoT can be a poor standard for interpretability. For what it’s worth, I don’t think your argument holds as a general reason not to expect steganography, especially for models which are tuned with RL.
I’d be glad to discuss this if you’re interested, but the broader point is that the resolution criterion should not treat models that use CoT as “interpretable” naively, unless there is significant additional evidence that its written justifications are faithful to the model’s internal reasoning. That seems highly difficult to prove in general, so I’m suggesting a weaker standard: does the model build a log of its cognition in natural language that it actually uses to generate its final answer? CoT would unambiguously qualify under this standard.
CoT doesn't actually accurately reflect the real reasoning happening. I will see if I can find the paper on this I read.
@osmarks Hmm. I have read the paper now. I feel a more accurate version of what this paper is saying would be "CoT doesn't always accurately describe the full reasoning process behind the answers it gives". In the paper, the reasoning still (not always, but most of the time) backs the options it gives. So it seems to me, if you will allow me to anthropomorphize a little bit, like the model sometimes finds an answer using some heuristic, and then post-hoc comes up with an explanation for that answer. I feel like this paper does not change my %prob of CoT solving, or having the capability of solving, interpretability for AGI. I have a few questions that could maybe make me change my mind. (they are also the reasons I haven't already)
1) Do you think CoT is more reflective of true reasoning in more complicated tasks, where the answer can't be decided in advance? Like with a multiple choice exam, it is easy to think "Its A, because its been A all times previously, now here is some fancier sounding explanation for why A is the most likely. But with something more complicated like writing a program or solving a sudoku, it seems very unlikely that the AI would first come up with an answer "#include <bits/stdc++.h> ..." or a filled in sudoku table, and then afterwards come up with the reasoning for why that program/solution is correct. At least if solving sudoku or writing a program is something it can't do unless you employ CoT and reflection.
2) Do you think CoT often reflects the true reasoning? CoT does improve performance on many tasks, I can't see why that would be the case if there is not something there. Ie the model is not just making up an answer and then rationalizing it, but something in the explanation produces the improved performance. If CoT never or only very rarely corresponds with the true reasoning, how do you explain the performance gain?
3) Do you think this is a fundamental limit of the technology, or do you think it could easily be fixed in the future? Like rewarding each reasoning step in CoT, or by using more clever prompting, like asking it to reveal any biases that might impinge on its reasoning, each deductive step? Seems to me like making it more accurately describe its reasoning, should not be that impossible to train into the model.
Do you think CoT is more reflective of true reasoning in more complicated tasks, where the answer can't be decided in advance?
I'm not sure. It does seem to be the case that when models get correct answers to complicated questions using chain of thought, the chain looks more plausible and correct, but that doesn't imply that the reasoning is "load-bearing": it could also be that models predict correct answers more following correct-looking reasoning.
Do you think CoT often reflects the true reasoning? [...] If CoT never or only very rarely corresponds with the true reasoning, how do you explain the performance gain?
Improved performance is also consistent with CoT just being a prompting style which is better at eliciting things the model can already do, or giving it more "time to think" (on RLed models, via steganography). I think the steganography possibility should be testable.
Human explanations of thought processes do not seem reflective of human reasoning processes either, so my prior for this is not high.
Do you think this is a fundamental limit of the technology, or do you think it could easily be fixed in the future?
OpenAI's "process supervision" (https://openai.com/research/improving-mathematical-reasoning-with-process-supervision) looks promising.
@osmarks Could we test this? By having a model do some task, and forcing it to write a lengthy and correct-sounding but purposefully irrelevant explanation before making a decision / giving an answer, and then seeing whether it still results in the same performance gains? Like, if it turned out it didn't, would it update you towards thinking explanations given in CoT actually are a faithful explanation of the models actual thinking process?
Also the math paper is more or less what I'm thinking of. Math is a specific area where this should work very well though.
@EmienerYunoski I am not sure exactly what you're proposing, but yes, there are some ways you could test what's happening. Swapping out generated explanations for paraphrases during chain of thought and observing noticeable performance degradations would be evidence for steganography. Replacing generated explanations with wrong ones and noticing performance degradation would be some evidence of CoT being "real" (though not extremely strong evidence, since it could just be that the model notices the explanations are wrong).
@osmarks The latter one. Do you know if anyone has done research on this?
It seems like it would be strong evidence to me. I don't understand the reasoning you're giving for it not being strong evidence. Why would the model suddenly perform worse because it notices an error in the (false) explanation we've given it, if the explanation doesn't really have anything to do with the actual thinking process it is going through.
Do you know if anyone has done research on this?
It seems like the sort of thing someone would have tried, but I don't know of anything like this.
Why would the model suddenly perform worse because it notices an error in the (false) explanation we've given it, if the explanation doesn't really have anything to do with the actual thinking process it is going through.
Because wrong explanations are often followed by wrong conclusions in human-written text, and helpfulness training may not succeed in eliciting all capabilities.