How difficult will Anthropic say the AI alignment problem is?
7
71
260
2030
11%
Optimistic scenario
56%
Intermediate scenario
4%
Pessimistic scenario
29%
No such statement before 2030

In a blog post by Anthropic entitled “Core Views on AI Safety: When, Why, What, and How” (https://www.anthropic.com/index/core-views-on-ai-safety) the authors outline three possibilities for the difficulty of the alignment problem:

“Optimistic scenarios: There is very little chance of catastrophic risk from advanced AI as a result of safety failures. Safety techniques that have already been developed, such as reinforcement learning from human feedback (RLHF) and Constitutional AI (CAI), are already largely sufficient for alignment. The main risks from AI are extrapolations of issues faced today, such as toxicity and intentional misuse, as well as potential harms resulting from things like widespread automation and shifts in international power dynamics - this will require AI labs and third parties such as academia and civil society institutions to conduct significant amounts of research to minimize harms.

Intermediate scenarios: Catastrophic risks are a possible or even plausible outcome of advanced AI development. Counteracting this requires a substantial scientific and engineering effort, but with enough focused work we can achieve it.

Pessimistic scenarios: AI safety is an essentially unsolvable problem – it’s simply an empirical fact that we cannot control or dictate values to a system that’s broadly more intellectually capable than ourselves – and so we must not develop or deploy very advanced AI systems. It's worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.”

If Anthropic publicly releases a blog post, announcement, or other official statement detailing a confident best guess at the difficulty of alignment, which of the above three scenarios will be closest to their assessment? This question asks about the first such statement that Anthropic makes, even if they later change their minds. Public statements by individual employees do not count unless they are made by an employee while officially acting as a spokesperson for the company. Statements that Anthropic makes jointly with other companies can count. If no such statement is made before Jan 1, 2030, or Anthropic ceases to exist as an entity before they release such a statement, this market resolves "No such statement before 2030."

A statement detailing the difficulty of alignment can resolve this market even if it doesn’t give numerical probabilities, as long as it expresses roughly the equivalent of >80% confidence in one of these three scenarios. The statement also does not need to specifically use or reference the optimistic/intermediate/pessimistic taxonomy of the above blog post, as long as it confidently advances a picture of the difficulty of alignment that corresponds closely to one of the scenarios. Because this may be a subjective determination, I will not bet in this market.

Finally, this question mostly focuses on the technical side of the alignment problem. For example, if Anthropic announces that relatively naive safety techniques seem sufficient for alignment, but still expresses concern for catastrophic risk from misuse or other actors not implementing safety measures, this question resolves as the optimistic scenario.

Get Ṁ600 play money

More related questions