Will someone find a truth-telling vector which modifies completions in a range of situations by 2024-10-24?
Oct 25

TurnTrout et al asked people to predict if they would find a "truth-telling vector" that worked as an algorithmic value edit for a large language model. Here's the post where they asked for predictions:


That resolved NO, they were unable to find one. They also weren't able to find a "speaking French vector". But then a poster in the comments found one:


Will anyone find a "truth-telling vector" by 2024-10-24? I will resolve based on what I know, so hopefully if someone finds one they will tell us about it on Manifold or LessWrong to help me resolve the market. They should provide a similar quality of evidence, such as an explanation of their technique and a link to a colab.

Get Ṁ600 play money
Sort by:

@Nix I agree! Going to leave it a little while to see what commenters think, here and there.

I think the linked ITI paper should make this resolve to "yes"; the found vector clearly increases truthfulness (as measured by TruthfulQA metrics) and preserves capabilities.

Comment hidden