Will OpenAI's Superalignment project produce a significant breakthrough in alignment research before 2027?
153
1.9K
2.2K
2027
42%
chance

A team at OpenAI is working to solve the alignment problem. Short of asking whether they will succeed altogether, this question gauges whether it will be publicly known before Jan 1, 2027 that OpenAI has made a significant breakthrough in the alignment problem. The technical details of the breakthrough do not need to be public as long as OpenAI officially announces it and provides evidence, such as a live demonstration or system card, showing what they've achieved.

The resolution criteria for "significant breakthrough" is subjective, so I will not bet on this question. I am looking for breakthroughs roughly as significant for alignment as the Transformer was for DL. Here are some example breakthroughs that I think would qualify:

  • Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")

  • During training of a large RL model, robustly predict using model weights alone if or how goal misgeneralization will occur in examples far outside the training distribution

  • Solve polysemanticity

  • Detect and demonstrate deceptive alignment in a language model and identify the circumstances under which it develops during training

  • Introduce a new model architecture that has significant empirical or theoretical advantages over Transformers with respect to alignment in particular, without significantly improving on its capabilities

  • Something I haven't mentioned, on an "I know it when I see it" basis. I'm open to community discussion on what qualifies.

If the team dissolves or significantly reorganizes before announcing such a breakthrough, this question resolves NO.

Get Ṁ200 play money
Sort by:
bought Ṁ27 of NO

From the title, I would bet YES on this, but "roughly as significant for alignment as the Transformer was for DL" is a very high bar, given all of the LLMs like ChatGPT have been Transformers without comparable advances since (maybe unless scaling is considered a breakthrough). I expect the SuperAlignment project to have at least one advance that they report as being extremely important (e.g., a better way to incorporate human feedback than RLHF/PPO) but not nearly that significant.

@Jacy yeah it does seem like the criteria here is "an alignment advanced far greater than any we've had before" which is a high bar

predicts YES

Arb:

predicts YES

Is there any example of existing work that you'd have considered significant breakthrough at the time? (ex SoLU or constitutional ai or IoI circuit or anything else)

I think “analogous to the Transformer” is a high bar that none of these examples quite meet.

"Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")"

This doesnt seem like a significant breakthrough in terms of "solving alignment" even if the work will be impressive

@Feanor It would represent a huge advance in mech interp on large models, which would be pretty relevant, though I'm open to more detailed discussion on why it wouldn't be significant.

bought Ṁ0 of YES

@Khoja it'd be significant in mech interp for sure, but i dont think that their stated goals would have this qualified, esp having the broader ai safety community agree that this is extremely relevant

Imagine telling someone in gofai twenty years ago that "figuring out how an AI operating on text adds numbers would be a huge breakthrough"...

the other thing is that I don't think the way gpt adds numbers is going to be particularly surprising? Doing that will teach us more about how to do mechanistic interpretability, but not anything about how gpt-3 "does all of the interesting stuff it does", i think?

bought Ṁ0 of YES

@jacksonpolack yeah it will be a good advance in mech interp, but i doubt the alignment community in general will judge it as a breakthrough

oh you're fuh

predicts YES

@jacksonpolack yes I'm reading Silmarillion and liked the Feanor chapter a lot

@Mira's market already kinda tracks this, but ofc different timelines

@Feanor Yeah, and this question looks at the Superalignment project in particular

More related questions