Will tailcalled think that the Shard Theory alignment research program has achieved something important by October 20th, 2026?
Will tailcalled think that the Shard Theory alignment research program has achieved something important by October 20th, 2026?
21
1kṀ3759
2026
14%
chance

The Shard Theory research program by Team Shard is based on the idea that we should directly study the what types of cognition/circuitry are encouraged by reinforcement learning, and reason mechanistically about how reward leads to them, rather than the classic Outer Alignment/Inner Alignment factorization which aims to create a reward function that matches human values and somehow give an AI a goal of maximizing that reward. A hope is that understanding how reward maps to behavior is more tractable than Outer Alignment/Inner Alignment, and that we therefore might be able to solve alignment without solving Outer Alignment/Inner Alignment.

In 4 years, I will evaluate Shard Theory and decide whether there have been any important good results since today. I will probably ask some of the alignment researchers I most respect (such as John Wentworth or Steven Byrnes) for advice about the assessment, unless it is dead-obvious.

About me: I have been following AI and alignment research on and off for years, and have a somewhat reasonable mathematical background to evaluate it. I tend to have an informal idea of the viability of various alignment proposals, though it's quite possible that idea might be wrong.

At the time of making the prediction market, my impression is that the attempt to avoid needing to solve Outer Alignment will utterly fail, essentially because the self-supervised learning algorithms that can be used to indefinitely increase capabilities inherently require something like Outer Alignment to direct them towards the correct goal. However, Shard Theory might still produce insights useful for Inner Alignment, as it could give us a better understanding of how training affects neural networks.

More on Shard Theory:

https://www.lesswrong.com/posts/xqkGmfikqapbJ2YMj/shard-theory-an-overview

Get
Ṁ1,000
to start trading!


Sort by:

Either I dont understand tailcalled or the price is too low. Activation engineering a la GPT2XL steering (https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) seems pretty indicative, and the activation addition approach was directly found via shard theory reasoning.

(Also, to correct the market description -- shard theory is not about behavior per se, but about understanding the reinforcement -> circuitry/cognition mapping)

2y

@AlexT2a57 Fixed description, I think?

2y

@AlexT2a57 Regarding activation engineering, it's a pretty good question what implications it should have for the market price.

I don't feel surprised by the GPT-2 word activation vectors. Though that raises the question of why I haven't chosen to study it myself or encouraged others to study it and why I have been skeptical of shard theory. And I guess my answer to this is that both before and now I feel like it is unlikely that it will scale well enough that one can build on it. Like it reminds me a lot of the adjustment of latent variables in GANs, which never really seemed to take off and has now mostly been replaced by prompt engineering, even though I was very excited about it at the time.

But your findings that you can use it to find a way to retarget the search in a cheese-solving AI is encouraging. I don't know whether I had expected this ahead of time, but it is presumably the sort of thing we would see happening if algebraic value editing can in fact scale. I expect that we will see things happening over the next years, e.g. one possibility is that people will see the use of algebraic value editing and it starts becoming a standard tool, or another possibility is that you will abandon it as a dead-end. (Or also likely, something murky inbetween that will require some judgement. I'm willing to comment on proposed scenarios now to help anchor the judgement, but I can't give any final call yet as there is uncertainty about what scenario we will end up in.)

2y

More plausible than the “alignment nonsense”

If Karpathy, Hotz, Carmack, and probably Demis but he’s too nice to say so all think this stuff is garbage it’s probably garbage.

Either compute will be limited (stop at 0.x nm or tax flops) or it won’t.

As with Chernobyl or Covid—the person most likely to cause runaway AGI will probably be some half-wit “alignment researcher” themselves.

Will tailcalled think that the Shard Theory alignment research program has achieved something important by October 20th, 2026?, 8k, beautiful, illustration, trending on art station, picture of the day, epic composition

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Or create your own play-money betting market on any question you care about.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like betting still use Manifold to get reliable news.
ṀWhy use play money?
Mana (Ṁ) is the play-money currency used to bet on Manifold. It cannot be converted to cash. All users start with Ṁ1,000 for free.
Play money means it's much easier for anyone anywhere in the world to get started and try out forecasting without any risk. It also means there's more freedom to create and bet on any type of question.
© Manifold Markets, Inc.TermsPrivacy