Will someone find a truth-telling vector which modifies completions in a range of situations by 2024-10-24?
Mini
7
244
Oct 25
58%
chance

TurnTrout et al asked people to predict if they would find a "truth-telling vector" that worked as an algorithmic value edit for a large language model. Here's the post where they asked for predictions:

https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go

That resolved NO, they were unable to find one. They also weren't able to find a "speaking French vector". But then a poster in the comments found one:

https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=sqsS9QaDy2bG83XKP

Will anyone find a "truth-telling vector" by 2024-10-24? I will resolve based on what I know, so hopefully if someone finds one they will tell us about it on Manifold or LessWrong to help me resolve the market. They should provide a similar quality of evidence, such as an explanation of their technique and a link to a colab.

Get Ṁ600 play money
Sort by:

@Nix I agree! Going to leave it a little while to see what commenters think, here and there.

I think the linked ITI paper should make this resolve to "yes"; the found vector clearly increases truthfulness (as measured by TruthfulQA metrics) and preserves capabilities.

Comment hidden