
Will a >100 karma LW post claim "How to Catch a Liar" suffers from similar problems to CCS?
5
130Ṁ1442032
31%
chance
1H
6H
1D
1W
1M
ALL
A recent post by the DeepMind alignment team argues that Contrast-Consistent Search struggles to find a feature that represents "knowledge" among many possible proxy features in a model.
How to Catch an AI Liar uses blackbox methods to try and tell if a model is lying. I want to know if it suffers from similar problems to CCS.
I'm choosing a proxy of >100 karma LW post. The post does not have to be solely about this claim, but it should be materially about it. e.g. a general criticism of a bunch of methods with a section on this would could. A popular post with an unrelated postscriptum claiming this wouldn't count.
This question is managed and resolved by Manifold.
Get
1,000 to start trading!
People are also trading
Related questions
Will "Catching AIs red-handed" make the top fifty posts in LessWrong's 2024 Annual Review?
15% chance
Will "My Clients, The Liars" make the top fifty posts in LessWrong's 2024 Annual Review?
18% chance
Will "Detecting Strategic Deception Using Linear Probes" make the top fifty posts in LessWrong's 2025 Annual Review?
18% chance
Will "Sycophancy to subterfuge: Investigating rewar..." make the top fifty posts in LessWrong's 2024 Annual Review?
13% chance
Will "Value Claims (In Particular) Are Usually Bullshit" make the top fifty posts in LessWrong's 2024 Annual Review?
13% chance
Will "Takes on "Alignment Faking in Large Language ..." make the top fifty posts in LessWrong's 2024 Annual Review?
19% chance
Will "the case for CoT unfaithfulness is overstated" make the top fifty posts in LessWrong's 2024 Annual Review?
23% chance
Will "Simple probes can catch sleeper agents" make the top fifty posts in LessWrong's 2024 Annual Review?
9% chance
Will "Reducing LLM deception at scale with self-oth..." make the top fifty posts in LessWrong's 2025 Annual Review?
14% chance
Will "Surprising LLM reasoning failures make me thi..." make the top fifty posts in LessWrong's 2025 Annual Review?
16% chance