Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028? | Manifold

Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028?

9

190Ṁ159

2027

79%

chance

1H

6H

1D

1W

1M

ALL

A number of figures in the alignment community have expressed interest in relaxed adversarial training as a technique for model alignment or auditing that may see wider use in the future, much as how RLHF eventually became a standard component of large model alignment. Examples:

Short of asking whether relaxed adversarial training will become as ubiquitous as RLHF, this question instead asks whether there will exist a technique that uses RAT to measurably improve the safety of the largest models. The question resolves YES if, before 2028, it is publicly known that there exists a technique that:

Involves relaxed adversarial training, i.e. targeted perturbations to model latents when auditing model behavior, or as any component of a training objective, or as part of any process to improve a model's OOD robustness

Can scale to models larger than GPT-3
Usefully improves some axis of safety, such that at least one AI lab with a market cap over $1 billion is publicly known to implement it for their models. It doesn't necessarily have to be the best technique of its class, but it has to work and be used

Relaxed adversarial training for inner alignment — AI Alignment Forum

This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also represents my current agenda regarding what I…

Technical AI Timelines

Alignment Research Agendas

Get

1,000

to start trading!

Sort by:

reposted

Looking very likely imo with this paper
https://arxiv.org/abs/2407.15549

reposted

Reposting in the context of https://arxiv.org/abs/2403.05030

People are also trading

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will there be major breakthrough in LLM Continual Learning before 2026?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

Will any widely used LLM be pre-trained with abstract synthetic data before 2030?

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will Apple release its own LLM on par with state of the art LLMs before 2026?

Will >= 1 alignment researcher/paper cite "maximum diffusion reinforcement learning" as alignment-relevant in 2025?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

Related questions

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will there be major breakthrough in LLM Continual Learning before 2026?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

Will any widely used LLM be pre-trained with abstract synthetic data before 2030?

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

Will Apple release its own LLM on par with state of the art LLMs before 2026?

Will >= 1 alignment researcher/paper cite "maximum diffusion reinforcement learning" as alignment-relevant in 2025?

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

© Manifold Markets, Inc.•Terms•Privacy