Will a >100 karma LW post claim "How to Catch a Liar" suffers from similar problems to CCS?
5
130Ṁ144
2032
31%
chance

A recent post by the DeepMind alignment team argues that Contrast-Consistent Search struggles to find a feature that represents "knowledge" among many possible proxy features in a model.

How to Catch an AI Liar uses blackbox methods to try and tell if a model is lying. I want to know if it suffers from similar problems to CCS.

I'm choosing a proxy of >100 karma LW post. The post does not have to be solely about this claim, but it should be materially about it. e.g. a general criticism of a bunch of methods with a section on this would could. A popular post with an unrelated postscriptum claiming this wouldn't count.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy