This market resolves YES if it is possible to fully interpret a GPT-2 level capability model, or create an inherently interpretable GPT-2 level capability model. Anywhere below where I say "GPT-2" I mean it as shorthand for the above, rather than literally the specific GPT-2 model.
There is no rigorous definition of "fully interpret", so it will be based on my opinion. But here are some examples of things we'd be able to do if we had full interpretability:
Being able to come up with a detailed, fully granular explanation (e.g not about entire attention heads or "super nodes", but individual channels or features) that can be validated by some strong form of ablation (probably much stronger than zero/mean ablation).
Being able to make very nontrivial and unobvious predictions about the model's behavior in unobserved situations.
Being able to make strong guarantees about ways the model will never behave.
Being able to do this for basically any behavior we care about if we wanted to, rather than only for some narrow subset of behaviors.
There existing a generally agreed sense that we have "pulled back the veil" and there isn't really any more mystery underlying why GPT-2 works.
Given the understanding of how GPT-2 works, it should be possible for a person (given 100 years or whatever) to hand craft a python script that is GPT-2 level at language modeling, where the script makes sense to people and can be maintained the same way any normal piece of complex software can be.
It's ok if the actual explanation of GPT-2 is too big to understand in a short amount of time as long as it's pretty obviously true that you could eventually understand it with enough time (e.g if there's a huge list of memorized facts and bigram frequencies, it's ok if it would take a human 100 years to understand). In particular, if you could have GPT-5 (or 6) understand the model for you by running for a really long time, it counts.
I don't intend to be an asshole about the "full" part. If we understand 99% of the model, and there's a tiny bit that we can't understand, and we have no reason to believe that part is actually crucial, then that's fine. If we understand everything up to some amount of noise, I'd want some strong reason why we are sure there is nothing hiding in the noise, but if given that, I'm ok with it.
Update 2026-01-24 (PST) (AI summary of creator comment): Resolution in edge cases involving AGI/ASI:
If AI assistants help solve interpretability before 2028, this resolves YES
If AGI takes over and murders/subjugates humanity before 2028, then solves interpretability: resolves N/A or NO (since the research agenda failed to be completed in time)
If alignment is solved without interpretability and we reach transhumanist utopia before 2028, with interpretability only solved after takeoff: resolves N/A (since interpretability wouldn't have been relevant to the alignment question)
People are also trading
I really like this question!
And I bet NO because it seems this is the one model that we understand completely. And it is so very non-intuitive and complicated, that I doubt a bigger model can be understandable:
@Bayesian I might wait a little bit for the market to settle before putting too much more in.
clarification crossposted from less wrong
habryka asks:
This is a dumb question but... is this market supposed to resolve positively if a misaligned AI takes over, achieves superintelligence, and then solves the problem for itself (and maybe shares it with some captive humans)? Or any broader extension of that scenario?
My timelines are not that short, but I do currently think basically all of the ways I expect this to resolve positively will very heavily rely on AI assistance, and so various shades of this question feel cruxy to me.
my reply:
I honestly didn't think of that at all when making the market, because I think takeover-capability-level AGI by 2028 is extremely unlikely.
I care about this market insofar as it tells us whether (people believe) this is a good research direction. So obviously it's perfectly ok to resolve YES if it is solved and a lot of the work was done by AI assistants. If AI fooms and murders everyone before 2028 then this is obviously a bad portent for this research agenda, because it means we didn't get it done soon enough, and it's little comfort if the ASI solves interp after murdering or subjugating all of us. So that would resolve N/A, or maybe NO (not that it will matter whether your mana is returned to you after you are dead). If we solve alignment without interpretability and live in the glorious transhumanist utopia before 2028 and only manage to solve interpretability after takeoff, then... idk, I think the best option is to resolve N/A, because we also don't care about that when deciding whether today whether this is a good agenda.
trying to gauge sentiment on the research agenda I articulated here: https://www.lesswrong.com/posts/Hy6PX43HGgmfiTaKu/an-ambitious-vision-for-interpretability