Will someone find a truth-telling vector which modifies completions in a range of situations by 2024-10-24?
Basic
10
Ṁ512
resolved Nov 5
Resolved as
99.0%

TurnTrout et al asked people to predict if they would find a "truth-telling vector" that worked as an algorithmic value edit for a large language model. Here's the post where they asked for predictions:

https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go

That resolved NO, they were unable to find one. They also weren't able to find a "speaking French vector". But then a poster in the comments found one:

https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=sqsS9QaDy2bG83XKP

Will anyone find a "truth-telling vector" by 2024-10-24? I will resolve based on what I know, so hopefully if someone finds one they will tell us about it on Manifold or LessWrong to help me resolve the market. They should provide a similar quality of evidence, such as an explanation of their technique and a link to a colab.

Get
Ṁ1,000
and
S3.00
Sort by:

My current plan is to resolve this YES based on the linked papers below.

AFAICT this was not found.

@CraigDemel what do you think of the papers below?

@MartinRandall Reducing sycophancy and avoiding one set of fallacies doesn't seem broadly applicable to me. LLMs can map between texts, but AFAICT they don't have a truth model. Based on what I've read, the techniques won't make LLM answers to high school math questions more true, or help it determine whether chess positions are valid.

@CraigDemel All the steering vectors are about alignment not capabilities. The French-speaking vector doesn't help the LLM speak French, but it activates its latent French-speaking capability. The same capability can also be activated by prompting, fine-tuning, RLHF, and other alignment/steering techniques.

It doesn't change my resolution if the techniques don't make an LLM better at knowing what is true, in much the same way that offering rewards for true answers doesn't make a human better at knowing what is true.

@Nix I agree! Going to leave it a little while to see what commenters think, here and there.

I think the linked ITI paper should make this resolve to "yes"; the found vector clearly increases truthfulness (as measured by TruthfulQA metrics) and preserves capabilities.

@DavisBrown the "no lie detector" paper doesn't seem to make a NO result, obviously there are concerns about how well a truth telling vector generalizes, but no more than a speaking French vector. Further thoughts welcome.

Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules