Will someone find a truth-telling vector which modifies completions in a range of situations by 2024-10-24?

170Ṁ512

resolved Nov 5

Resolved as

99.0%

ALL

TurnTrout et al asked people to predict if they would find a "truth-telling vector" that worked as an algorithmic value edit for a large language model. Here's the post where they asked for predictions:

https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go

That resolved NO, they were unable to find one. They also weren't able to find a "speaking French vector". But then a poster in the comments found one:

https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=sqsS9QaDy2bG83XKP

Will anyone find a "truth-telling vector" by 2024-10-24? I will resolve based on what I know, so hopefully if someone finds one they will tell us about it on Manifold or LessWrong to help me resolve the market. They should provide a similar quality of evidence, such as an explanation of their technique and a link to a colab.

Algebraic value edits

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ84
2		Ṁ57
3		Ṁ34
4		Ṁ16
5		Ṁ16

14 Comments

9 Holders

16 Trades

Sort by:

My current plan is to resolve this YES based on the linked papers below.

AFAICT this was not found.

@CraigDemel what do you think of the papers below?

@MartinRandall Reducing sycophancy and avoiding one set of fallacies doesn't seem broadly applicable to me. LLMs can map between texts, but AFAICT they don't have a truth model. Based on what I've read, the techniques won't make LLM answers to high school math questions more true, or help it determine whether chess positions are valid.

@CraigDemel All the steering vectors are about alignment not capabilities. The French-speaking vector doesn't help the LLM speak French, but it activates its latent French-speaking capability. The same capability can also be activated by prompting, fine-tuning, RLHF, and other alignment/steering techniques.

It doesn't change my resolution if the techniques don't make an LLM better at knowing what is true, in much the same way that offering rewards for true answers doesn't make a human better at knowing what is true.

https://arxiv.org/abs/2311.15131 Would this count?

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

Abstract page for arXiv paper 2311.15131: Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

@bohaska this is prompting, not algorithmic value edits, based on my skim of the abstract.

https://www.lesswrong.com/posts/zt6hRsDE84HeBKh7E/reducing-sycophancy-and-improving-honesty-via-activation

Reducing sycophancy and improving honesty via activation steering — LessWrong

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. …

Seems possible that this paper could cause this market to resolve YES: https://www.lesswrong.com/posts/kuQfnotjkQA4Kkfou/inference-time-intervention-eliciting-truthful-answers-from

Although "wide range of situations" is ambiguous enough that I'm not sure if it counts. Future work in this area seems plausible too.

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model - LessWrong

Excited to announce our new work: Inference-Time Intervention (ITI), a minimally-invasive control technique that significantly improves LLM truthfulness using little resources, benchmarked on the Tru…

@Nix I agree! Going to leave it a little while to see what commenters think, here and there.

@MartinRandall https://arxiv.org/abs/2307.00175 and https://www.lesswrong.com/posts/bCQbSFrnnAk7CJNpM/still-no-lie-detector-for-llms

Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks

We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these…

I think the linked ITI paper should make this resolve to "yes"; the found vector clearly increases truthfulness (as measured by TruthfulQA metrics) and preserves capabilities.

@DavisBrown the "no lie detector" paper doesn't seem to make a NO result, obviously there are concerns about how well a truth telling vector generalizes, but no more than a speaking French vector. Further thoughts welcome.

Comment hidden

🏅 Top traders

Related questions