A number of figures in the alignment community have expressed interest in relaxed adversarial training as a technique for model alignment or auditing that may see wider use in the future, much as how RLHF eventually became a standard component of large model alignment. Examples:
Short of asking whether relaxed adversarial training will become as ubiquitous as RLHF, this question instead asks whether there will exist a technique that uses RAT to measurably improve the safety of the largest models. The question resolves YES if, before 2028, it is publicly known that there exists a technique that:
Involves relaxed adversarial training, i.e. targeted perturbations to model latents when auditing model behavior, or as any component of a training objective, or as part of any process to improve a model's OOD robustness
Can scale to models larger than GPT-3
Usefully improves some axis of safety, such that at least one AI lab with a market cap over $1 billion is publicly known to implement it for their models. It doesn't necessarily have to be the best technique of its class, but it has to work and be used