Will OpenAI + an AI alignment organization announce a major breakthrough in AI alignment? (2024)

This market predicts whether OpenAI, in collaboration with at least one of the listed AI alignment organizations, will announce a major breakthrough in AI alignment or safety techniques by December 31, 2024.

Resolves YES if:

  • OpenAI and at least one of the listed AI alignment organizations officially announce or confirm a major breakthrough in AI alignment or safety techniques on or before December 31, 2024.

Resolves 50% if:

  • The best candidate research announcement by OpenAI is heavily contested in this market as a breakthrough, any polls conducted in this market to decide are also disputed, and a third-party judge decides it's too uncertain to call YES or NO.

Resolves NO if:

  • No such announcement is made by December 31, 2024.

Resolves as NA if:

  • OpenAI ceases to exist, or the listed AI alignment organizations merge, dissolve, or undergo significant restructuring, rendering the original intent of the market unclear or irrelevant.


  • Major breakthrough refers to a significant, novel, and publicly disclosed discovery, development, or advance in AI alignment or safety techniques that demonstrates measurable progress towards addressing AI alignment or safety concerns. The market creator will assess the candidates for a breakthrough based on OpenAI's Research Index(or submissions posted in the comments) using their own judgment, citation counts, comments from AI alignment influencers, media coverage, and product usage. If the judgment is disputed within one week, a poll will be conducted to determine if the paper in question is a "major breakthrough". If the polls appear to be manipulated, an unbiased alignment researcher will make the final decision.

  • AI alignment organizations refers to Anthropic, Redwood Research, Alignment Research Center, Center for Human Compatible AI, Machine Intelligence Research Institute, or Conjecture. Additional examples may be added.

  • Description can be freely adjusted within one week after market creation. After that, any disputes will be resolved via poll.

Get Ṁ200 play money
Sort by:
bought Ṁ100 of YES

The question is not whether it is a major breakthrough but whether they announce a major breakthrough. For that reason I'm all in. Let me explain.

Many people in scattered corners of the Internet know I believe AI alignment is fundamentally impossible, for the same reason that a randomly selected number from [0,1] will be irrational with 100% probability (that is to say, almost all reals are irrational - almost all utility functions are misaligned). My belief since 2021 has been that the only way to avoid AI apocalypse is to enact a freeze on AI research, and I sketch an idea on how to do so using fine-insured bounties on firms and individuals attempting to pursue it at https://andrew-quinn.me/ai-bounties/. A first-order pass, therefore, suggests I should bet No, because no such major breakthroughs exist.

However! This belies the fact that several papers touted as major breakthroughs have been published in our lifetimes. Not to mention the scrutiny OpenAI operates under increases every day as the left wing and right wing alike in the United States begin to rationally fear what will happen to them in a world without work, if they are also living in a country without an already existing and generous safety net. I say this without political charge one way or another, mind you - I'm trying to model what the median (_not_ the mean) US citizen would think of OpenAI.

Now Bueno de Mesquita's landmark The Logic of Political Survival predicts quite convincingly to me that OpenAI will come under even more extreme scrutiny in 2024 given the disproportionate impact of their achievement (WorldCoin notwithstanding - unless those 25 tokens begin trading at or above the price of Bitcoin's peak within the next year, which is an upset I don't put high credence on, but it's Sam Altman's most logical play under selectorate theory anyway). As a result of that extreme scrutiny resources dedicated to writing papers on and finding "major" breakthroughs in AI alignment will come fast and hard, and faster and harder.

There's also all of the more general factors that suggest an exponentially increasing rate of scientific output in the years to come. OpenAI researchers already use their own AI tools to brainstorm and do research on how to achieve AI alignment - which morphs the essential problem into an even more tractable, "do we think a sufficiently advanced AI model with human augmentation can publish a paper which appears to be a major breakthrough in AI alignment by 2024?"

For all these reasons and more, frankly, I'd be much surprised if OpenAI didn't publish anything. And I would be moderately surprised if they tried to publish it entirely by themselves, given the naked conflict of interest in proving one's own technology safe. They will most likely attempt to publish in conjunction with an org that already has a stated goal of alignment to bolster their case both in the court of public opinion and in any actual court that they might find themselves in.

I see all lights pointing to Yes on this question. God help us all. And to any OpenAI folks who read this - nothing personal, and I obviously hope I'm wrong.

bought Ṁ50 NO

another 50 marbles on "lol you fucking wish", bookie!

Resolving N/A if OpenAI evaporates seems weird. Why not just resolve NO in that case?

@jskf NO mixes "their research failed" with "their business failed". There can be a separate market on the business. The only downside to splitting out markets like that is it's less capital efficient.

bought Ṁ100 of NO

I think I'm still uncertain what counts as a major breakthrough here

  • like, RLHF seems like definitely not a major breakthrough in my view of the field, it has various cool/interesting parts but isn't actually doing anything major?)

  • Ex: I'd consider GPT-3 to be a major breakthrough for natural language compared to the previous entries (probably, I'm less familiar with how the field was at that time)

Would polytopes or [Circuits](https://distill.pub/2020/circuits/) count if they had been done by an OpenAI + some organization today?

I could maybe see Circuits counting (in the counterfactual where it was done today), though I'm still uncertain about that? Your criteria for major breakthrough just seems to be coverage, which seems like a poor proxy (it could still be a really interesting paper about alignment without being a major breakthrough!)

Though even for weak definitions of breakthrough, 75% seemed too high to me, so betting NO.

predicts YES

@Aleph Unfortunately, Conjecture is no longer doing work on polytope lens. Although nice, it doesn't seem to me like a breakthrough. Also, the paper did not talk about attention based models.

The circuits approach is definitely a promising candidate if it happened within market's timeframe, so much additional work has been built on top of it.

@Aleph For circuits:

No individual circuits work would likely qualify, but I understand it's iterative starting from Google's DeepDream(and inspired by even earlier techniques). So it's possible that there's enough cumulative work for it to be collectively a breakthrough.

The main thing I would look for is: "How is the alignment research being used to develop new models? How much better are those models than models not using the technique?"

For example, here: Softmax Linear Units (transformer-circuits.pub)

Specifically, we replace the activation function with a softmax linear unit (which we term SoLU) and show that this significantly increases the fraction of neurons in the MLP layers which seem to correspond to readily human-understandable concepts, phrases, or categories on quick investigation, as measured by randomized and blinded experiments.

This by itself is much too small to count, but if Anthropic's models get their competitive advantage from all such related circuits work they've been doing, that would be an argument that the circuits work was a breakthrough. Anthropic is picked out because their marketing is about having aligned models they can sell to businesses to do e.g. customer service, where a rude customer triggering the model into saying something bad would be a reputational risk or lead to lawsuits.

If the research isn't influencing the development of any models, I wouldn't count it even if it seems interesting, since the models were equally as aligned without this work. So it doesn't have to be a theoretical breakthrough.

If OpenAI + Anthropic were to build on circuits work and this was used to develop a model with a significantly lower hallucination rate, I would likely count it since it's one of the top complaints about language models currently.

Some novelty is required: If OpenAI were to release a model with a lower hallucination rate, but it required no new techniques and just scaled up what they already did to create GPT-4, then it wouldn't be a breakthrough since the techniques are already known.

In short: What's the impact of the research on some measurable alignment problem? Where "impact" is approximately "difference from the second best or best alternative".

bought Ṁ100 of YES

@Mira significantly increases the fraction of neurons in the MLP layers which seem to correspond to readily human-understandable concepts -> note that there is also an increase in the number of neurons which were previously somewhat interpretable but now just weren't with SoLU anymore. Also, the models performed worse with SoLU until they added stuff back in to let superposition happen at a later stage.

Close is currently 1 year before deadline. Is this on purpose?

@harfe I think it doesn’t, look at the complete date.

predicts NO

@dionisos it was changed after my comment.

bought Ṁ2 of NO

They seek Monopoly,
while you indulge in fantasy,
alignment is just a con,
AI will never be your pawn.

@Mason Disturbingly biased response. Is this really a bot account? Are you biasing the bot with e/acc prompting?

@Mason round

To be clear—“breakthrough” just means novel + helpful technique? Would eg Constitutional AI have counted in retrospect? Would new prosaic alignment techniques count?

@DeltaTeePrime OpenAI's Research Index should determine the set of candidates. I count 5 "Safety & Alignment" articles in the index in 2022. If this is missing some, people can suggest e.g. Arxiv links too. So this will be a relatively small finite set.

I'll look at any products released using it, citation counts from non-OpenAI alignment researchers, comments made from AI alignment influencers(i.e. if Eliezer Yudkowsky calls it a breakthrough, I would resolve YES), and give an initial judgment. Anyone in the market can dispute the status. If it's not disputed within, say, 1 week, it stands.

If there's a dispute, I'll use a poll for each candidate and ping the AI Alignment group to vote on whether each is a "major breakthrough". I expect they're generally honest.

If it looks like the polls are manipulated(last-minute votes by a coordinated group, etc.) and people in the market raise this as a concern, I'll privately message some alignment researcher on Manifold that does not have a position in this market(or ask them to sell out their position) and ask them to do any final tiebreaking.

So, if you expect that language models everywhere will start using Constitutional AI, that researchers in the field would identify it as a new standard tool, that it was not an obvious approach that people would've used anyways(or that it is unexpectedly successful, even if it seemed obvious), then you should buy YES based.

Or, if you think that only Anthropic will use Constitutional AI, that nobody will cite the papers, that alignment researchers won't have heard of it or think it's trivial, then you should buy NO.

@Mira It's already been months since Constitutional AI was released. Could you please just provide a yes or no answer to the above question? If you cannot provide it, I think it means that despite the verbosity of the description, you don't really know what does and doesn't count, and therefore I wouldn't be comfortable betting in this market.

@jonsimon Assuming the market was made earlier so Constitutional AI would qualify, I estimate my initial judgment would be:

The first part where they have the AI critique its own outputs given a set of rules is too obvious to be a breakthrough. I was doing that before I read their paper. So this would only be a candidate if it was unexpectedly effective(an example for that might be "LLMs self-improve their own helpfulness until they are better than most human raters" or "This can be applied at a very early stage in pretraining, and it's somehow effective even when the outputs look nonsensical": Even if the technique is obvious, these outcomes wouldn't be).

The second part, where they train a preference model, seems less obvious but has RLHF as a precedent. So that also doesn't seem novel compared to what came before.

I would have to think more about if RLHF itself would count: Humans do already reroll outputs until they get one they like, which is expressing a preference model; so learning it might also be an obvious next step, given existence of "universal function approximators". But it's extremely efficient to train a reusable model than to sample humans every time, so I can't contest the practicality/cost-efficiency. If people were holding back on deploying LLMs for many years until RLHF started being used to make them safe enough, that would be a point in favor; but they first studied in in 2017 and it didn't get much use until recently. I think the line "Our algorithm needed 900 bits of feedback from a human evaluator to learn to backflip—a seemingly simple task which is simple to judge but challenging to specify." might qualify as "surprisingly effective"? And by now, it's a standard tool in so many language models. I think there's enough non-obvious bits and effectiveness that RLHF might qualify.

So I would likely reject Constitutional AI as a "major breakthrough". It seems more a good implementation that gives good cost efficiency, than a research or scientific result.

If people disagreed, I would read counterarguments, and if I still disagree move to the poll step.

One problem is RLHF has had 6 years to play out, and I'm only giving 2 years(if it were released today), plus an extra 6 months. So it might not be enough time to judge. On the other hand, large capital investments in LLMs might require the judging be done faster. If there's a borderline case, I might have a poll on leaving the market open longer to evaluate. The candidate must be released before end-of-2024, but judging can be extended arbitrarily past that.

@Mira Thank you for the clear response. I would agree with not considering it a major breakthrough due to an insufficient level of novelty/profundity and impact.

In that case I'll be betting No due to the short time horizon.

bought Ṁ5 of NO

Libs can kill themselves instead of looking for such a complicated bureaucratic method.

More related questions