Will a >100 karma LW post claim "How to Catch a Liar" suffers from similar problems to CCS? | Manifold

Will a >100 karma LW post claim "How to Catch a Liar" suffers from similar problems to CCS?

5

130Ṁ144

2032

31%

chance

1H

6H

1D

1W

1M

ALL

A recent post by the DeepMind alignment team argues that Contrast-Consistent Search struggles to find a feature that represents "knowledge" among many possible proxy features in a model.

How to Catch an AI Liar uses blackbox methods to try and tell if a model is lying. I want to know if it suffers from similar problems to CCS.

I'm choosing a proxy of >100 karma LW post. The post does not have to be solely about this claim, but it should be materially about it. e.g. a general criticism of a bunch of methods with a section on this would count. A popular post with an unrelated postscriptum claiming this wouldn't count.

Get

1,000

to start trading!

People are also trading

Will "Catching AIs red-handed" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "My Clients, The Liars" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Sycophancy to subterfuge: Investigating rewar..." make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Value Claims (In Particular) Are Usually Bullshit" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Takes on "Alignment Faking in Large Language ..." make the top fifty posts in LessWrong's 2024 Annual Review?

Will "the case for CoT unfaithfulness is overstated" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Simple probes can catch sleeper agents" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "0. CAST: Corrigibility as Singular Target" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Detecting Strategic Deception Using Linear Probes" make the top fifty posts in LessWrong's 2025 Annual Review?

Will "Reducing LLM deception at scale with self-oth..." make the top fifty posts in LessWrong's 2025 Annual Review?

Related questions

Will "Catching AIs red-handed" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "My Clients, The Liars" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Sycophancy to subterfuge: Investigating rewar..." make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Value Claims (In Particular) Are Usually Bullshit" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Takes on "Alignment Faking in Large Language ..." make the top fifty posts in LessWrong's 2024 Annual Review?

Will "the case for CoT unfaithfulness is overstated" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Simple probes can catch sleeper agents" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "0. CAST: Corrigibility as Singular Target" make the top fifty posts in LessWrong's 2024 Annual Review?

Will "Detecting Strategic Deception Using Linear Probes" make the top fifty posts in LessWrong's 2025 Annual Review?

Will "Reducing LLM deception at scale with self-oth..." make the top fifty posts in LessWrong's 2025 Annual Review?

© Manifold Markets, Inc.•Terms•Privacy