Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028?

A number of figures in the alignment community have expressed interest in relaxed adversarial training as a technique for model alignment or auditing that may see wider use in the future, much as how RLHF eventually became a standard component of large model alignment. Examples:

Short of asking whether relaxed adversarial training will become as ubiquitous as RLHF, this question instead asks whether there will exist a technique that uses RAT to measurably improve the safety of the largest models. The question resolves YES if, before 2028, it is publicly known that there exists a technique that:

  • Involves relaxed adversarial training, i.e. targeted perturbations to model latents when auditing model behavior, or as any component of a training objective, or as part of any process to improve a model's OOD robustness

  • Can scale to models larger than GPT-3

  • Usefully improves some axis of safety, such that at least one AI lab with a market cap over $1 billion is publicly known to implement it for their models. It doesn't necessarily have to be the best technique of its class, but it has to work and be used

Get Ṁ600 play money
Sort by:

Reposting in the context of https://arxiv.org/abs/2403.05030

More related questions