AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

"Outer alignment" as in the model is not incentivized to lie to humans (of course it must still do things, the question isn't just about can you build an AI that doesn't lie)

Get Ṁ600 play money
Sort by:

Can you explain what the difference is from your other question here: ?
Does the other question have to handle inner alignment?

Just to be clear, does this count lying due to things other than outer incentives? (i.e inner misalignment) Also, does the system have to be close to SOTA? (maybe we figure out how to make honest LMs but by then our SOTA has moved on to much more dangerous AI systems)

@LeoGao Lying due to other things does not count. For this market the system does not need to be close to SOTA. I will not accept trivial/very stripped down solutions, but otherwise I will be quite forgiving. For instance if someone came out with this today for RNNs I would resolve YES. (The target is a procedure for some architecture/problem setup that can in theory generalize even if no one has achieved that generalization in practice).

More related questions