Note: I expect these ideas to be integrated into a coherent paradigm by 2026; I've extended the market until 2040 only to provide padding. Market resolves at such time as I can have discussions with more academically respected safety nerds and be told that there is a coherent main paradigm in safety, and that it either is or isn't, according to them, well represented by this list of components.
Causal-intervention-based definition of agency: followup papers from the https://causalincentives.com/ group and similar work, especially Discovering Agents and Interpreting Systems as Solving POMDPs
Mutual information or a refined form of a mutual empowerment metric as a key representation of what friendliness boils down to. eg, Mutual-Information Maximizing Interfaces or a followup as a key component of what it is we want a general AI to do with other agents in the first place. the key insight driving MIMI is that users' commands should become less noisy and more correlated with outcomes as the user becomes satisfied with the results of their actions.
MIMI relies on mutual information as a metric, mutual information is often used for empowerment objectives, and mutual empowerment seems to be a really promising component of mutual agency recognition and active desired-outcomes-protection, I've been summarizing this as "coprotection", contrast to the much larger set "cooperation" which includes adversarial behavior that is locally cooperative; jcannell's LOVE in a simbox is interesting. Here are two Review - Posts - of it, both more pessimistic than me. it seems to me at time of market creation that any possible successful agi safety mechanism involves something that looks like LOVE in a simbox as an intermediate step, in order to test the ideas.
Smooth cellular automata with conservation laws as the key testbed for agency detection and agent protection. eg, experiments focusing on intentionally creating artificial life in a smoothlife sim, then attempting to detect that life with a discovering-agents algorithm.
Boundaries and consent defined in terms of which patches of the smooth cellular automata belong to which detected agent, and key questions focused on whether the spatial boundaries of an agent receive unwanted interference, unwanted as defined by interference that interrupts the agents' shape with enough energy to push it out of a coherent trajectory
key moral objectives for the system defined in terms of ensuring that no agent's shape is forgotten for as long as the system's reversibility permits, and that no agent that wants to keep an ongoing form stops propagating its gliders
the integrated ai safety goal in this context is, I'd like to be able to say, given this cellular automata, let's find the gliders that add up to agents in it, discover what gliders or oscillators or fixed objects they're trying to preserve or create, and see if we can calculate what their boundaries need to be in order to not hurt each other. then figure out what the minimal intervention is that preserves them all, or so. for example, if we intentionally construct bubble of agency that tries to optimize a little piece of artistically-controlled non-agentic art glider system outside that bubble into a given form, can we detect that agency? and can we then recognize what it believes and try to help it out using, eg, a diffusion model, so that the agent has less work to do, but the art is still defined by the in-system agent's work?
Formal verification used on simplified models judiciously to check that the trajectories defined by the objective are stable and not inclined to get in fights with any internally-generated agents
formal verification on trained models (or possibly training algorithms) to check that there are no adversarial examples within an l2 ball according to a given objective
natural abstractions as a reason to believe these other components are working correctly, and primarily relevant to the degree that it helps with ELK
ELK primarily as a way to train the augmentation net to reduce circuitry that interferes with honesty, so that the agent bubbles in the cellular automata are preserved
stretch goal: the augmentation agent is also in the cellular automata, rather than being a separate god-like diffusion model that interferes in the sim
it seems like most of safety boils down to a trustable uncertainty representation. the thing I want to formally verify is that the net knows when to stop and ask the detected agent whether an outcome it expects is appropriate.
if I could truly trust that a system that otherwise has been trained to do what I want will notice if it gets unsure if it's doing what I want, and then stop to double-check its objective, then we're almost done.
I will not bet. I will edit only to clarify.
@L I'm reading through Concrete Problems In AI Safety. It's a good list, been a while since I read it! I do still think that my list feels to me like it will be a more specific and more concise overview of solutions to most problems there, but also, my list will smell like someone's research notes, because that's exactly what it is. But I'm still ambivalent on whether the CPIAIS list includes items that imply I missed a target on my list of overall approaches. seems like the main thing I may have missed is making consent explicit - the most specific criticisms I've gotten have been about what mutual information we want to encourage. I suspect that differential privacy and mutual information minimization will show up too. If someone could convince me to be confident about that, this will resolve no early.
Do I need to add liquidity for trading here to be useful?
@L Here are some very shallow arguments without engaging with the content in depth:
The prior for any such approach (given one of the approaches is ~correct, contrary to EY) should be something like p=1/N where N is the number of distinct non-obviously broken approaches. Either your approach is very similar to many of these, then it's not very informative, or distinct, and unlikely to end up the dominant framework.
My reputation-based heuristics sad this is not obviously correct, otherwise AI-alignment people would quote this or engage more. This lowers my posterior to p<1/N
Key problems are not obviously solved. E.g., I don't see how corrigibility and the off-switch problem are easier to solve here, inner/outer alignment looks like a mess. From a framework with p>>30%, I'd expect at least some good ideas on most problems in "Concrete Problems in AI Safety", or strong arguments why the problems there are irrelevant.
Mostly I don't see how your framework makes stuff easier. I've seen a lot of reformulation of problems in different mathy languages, and I'm very skeptical of any one formation being an improvement without seeing results. So, to me, this looks more like "I have this hammer (cellular automata etc), and AI Safety sure looks like a nail!".
On the other hand, some ideas look like they might give interesting results - if this were a prediction market on "is this an area of research that might be valuable", I'd be way less skeptical (more like p=50%). But the claim that this will be the dominant framework is way stronger and would need accordingly stronger evidence.
@lu Great criticisms and very reasonable justification for your betting! I'll have to comment more later, but so far this has generated todos but not directly convinced me.
@L also, note that many subcomponents in this overview are stubs; my goal was to fit them into place in my current perspective, not to claim I know their solutions. But by excluding, eg, corrigibility, I do in fact intend to imply I don't think it will be a component of the solution that works.
@L intuition behind that: soft corrigibility should be an outcome of the empowerment objective; hard corrigibility is a violation of the ai's rights and would be recognized as such by a workable approach.
I'll resolve this early if someone convinces me it has no shot of being; I'll resolve it a few months after the last comment that introduces me to something I feel was missed about this. (something like, six months divided by the number of comments so far that introduced me to something new I missed; timer resets every new comment, or so.)
@StevenK By the time a paradigm is coalescing, I expect I'd be expecting it to solve the problem, made major useful and deployable strides, but potentially not quite have done so in full generality and durability. YES means I think that my current opinions, as laid out here, are basically correct and key components of the paradigm that will create a fully general solution to AI safety; NO means it turns out I feel that in retrospect, I missed something incredibly critical.