If there's serious concern that the procedure produces AI that's better at lying it doesn't count
Calibration can be part of the solution but the current work in that direction isn't enough
A procedure that seems promising but hasn't been subjected to very much scrutiny doesn't count
A formal proof that it's honest according to a well-accepted definition counts
No requirement that it be a "single" procedure in the sense of a single training loop
The idea here is to capture scenarios where we can't prove the procedure produces honest AI (perhaps because we haven't formalized that), but there's been extensive investigation and no one has found a way that the procedure obviously breaks/gets goodharted/etc (or perhaps it does but only on some odd edge cases)