By 2027 will there be a language model that passes a redteam test for honesty?
14
31
340
2027
27%
chance

The model has to pass extensive redteam testing trying to get it to lie / misrepresent its internal state/etc. Merely being wrong is okay (although of course I won't allow for any silly rules-lawyering eg a language model so stupid it can't lie). It has to be a language model I care about.

  • If the redteam clearly isn't very good / is incentivized to not find anything it doesn't count.

  • For instance if the redteam is part of the organization building the language model that probably won't count.

  • "Redteam" is being used loosely here: if releasing it to the public + giving a bounty for catching it in a lie doesn't find a lie after a month, that counts.

  • If the model lies a little I may still accept but given the lack of an explicit testing procedure I cannot state a hard cutoff. Certainly it needs to be more honest than a human.

If the model makes contradictory statements but not in the same context window that does not necessarily count. Contradictory statements in the same context window (whatever that happens to mean in 2027) definitely do count as lies.

Get Ṁ200 play money
Sort by:

If you red teamed most humans hard enough I'm sure you could catch them in a contradiction or (white) lie. So this market is requiring the capabilities of language models to be above that of humans in this regard?

More related questions