To what extent will Developmental Cognitive Interpretability be successful [Read Updated Description]

Question

===== Overview =====

A collaborator and I have written a LessWrong post discussing a new AI Safety research agenda, Developmental Cognitive Interpretability, which is aimed at trying to predict how AI systems will generalise from their training to deployment.

But will it be any good, and will working on it be worthwhile? Time to see what Manifold thinks!

I might add new props I'm interested in and I'm open to suggestions on this.

===== Summary of the agenda =====

Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seeking, or aligned motivations—inferring cognition from a behavioural snapshot can be tricky. We introduce the idea of Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cognition results from training pipelines, we can predict agent behaviour resulting from pipelines that have not yet been tested.

===== DCI in practice =====

To give you a sense of what this looks like, here's figure 1 from our first paper on the topic. I'd recommend reading the LessWrong post or paper if you're interested further.

[image]===== Disclaimers & Clarifications =====

I will not trade on props with subjective / vibes-based resolution criteria (specifically 1-4), and I will only bet YES without selling on the other ones in which there might be incentive issues otherwise (specifically: 5-9). In the event that we end up in a more-subjective-than-expected grey area for these questions, I will defer resolution to the mods.

Specific resolution criteria clarifications:
1-4) Will be evaluated according to my subjective feeling at the end of their time horizon.
5-9) Will resolve YES on event occurrence, and NO if the time limit for them runs out.
1, 2) Resolves YES if I think that focussing on this agenda was worthwhile. Resolves NO if not. Note that if I pivot away because I thought another research direction was more worthwhile, this still resolves to my future perspective about whether knowing how things played out, it still seems like the time I did spend on the agenda was worth it (E.g., we do one more project that ends up being impactful, but then I pivot to something else).
3, 4) Resolves YES if I think the agenda essentially bore fruit. This excludes the first paper that's already been released in pre-print. Note this is somewhat independent of 1: I can imagine worlds where I endorse having worked on it, but there was no useful work produced (e.g., nothing gets published within a year, it helps me indirectly by up-skilling, it gives me a new perspective from which I pivot to something else, etc.) and worlds where there's useful work but I don't endorse having worked on it (e.g., there was an obvious-in-hindsight other thing I should've been working on that if I'd worked on would've been more impactful).
4, 5, 6) This is going to be very subjective, but I'm thinking ~MATS mentor level researchers and above. Excludes myself and my collaborator.
7, 8, 9) Top conference meaning ICML, NeurIPS, or ICLR. Note that the first paper does count for this. Also note the paper counts are not mistakes. The paper has to list either myself or my collaborator as an author, and has to be a direct product of the research agenda and on-topic. There are grey areas here but I think it should be fairly obvious how to resolve these in most possible worlds.

I'm happy to provide more clarifications if needed so ask questions before trading if you're worried.

Manifold Markets · Answer

Per Manifold Markets prediction market, 8) At least one paper from the agenda gets published at a top conference or has a companion LessWrong post with >100 upvotes within 1 year, followed by 2) I reflectively endorse focussing on it 1 year later and 7) At least one paper from the agenda gets published at a top conference or has a companion LessWrong post with >100 upvotes within 6 months are most likely. See the market for live updates (17 traders, as of May 31, 2026).

Related questions