
Note: I expect these ideas to be integrated into a coherent paradigm by 2026; I've extended the market until 2040 only to provide padding. Market resolves at such time as I can have discussions with more academically respected safety nerds and be told that there is a coherent main paradigm in safety, and that it either is or isn't, according to them, well represented by this list of components.
Causal-intervention-based definition of agency: followup papers from the https://causalincentives.com/ group and similar work, especially Discovering Agents and Interpreting Systems as Solving POMDPs
Mutual information or a refined form of a mutual empowerment metric as a key representation of what friendliness boils down to. eg, Mutual-Information Maximizing Interfaces or a followup as a key component of what it is we want a general AI to do with other agents in the first place. the key insight driving MIMI is that users' commands should become less noisy and more correlated with outcomes as the user becomes satisfied with the results of their actions.
MIMI relies on mutual information as a metric, mutual information is often used for empowerment objectives, and mutual empowerment seems to be a really promising component of mutual agency recognition and active desired-outcomes-protection, I've been summarizing this as "coprotection", contrast to the much larger set "cooperation" which includes adversarial behavior that is locally cooperative; jcannell's LOVE in a simbox is interesting. Here are two Review - Posts - of it, both more pessimistic than me. it seems to me at time of market creation that any possible successful agi safety mechanism involves something that looks like LOVE in a simbox as an intermediate step, in order to test the ideas.
Smooth cellular automata with conservation laws as the key testbed for agency detection and agent protection. eg, experiments focusing on intentionally creating artificial life in a smoothlife sim, then attempting to detect that life with a discovering-agents algorithm.
Boundaries and consent defined in terms of which patches of the smooth cellular automata belong to which detected agent, and key questions focused on whether the spatial boundaries of an agent receive unwanted interference, unwanted as defined by interference that interrupts the agents' shape with enough energy to push it out of a coherent trajectory
key moral objectives for the system defined in terms of ensuring that no agent's shape is forgotten for as long as the system's reversibility permits, and that no agent that wants to keep an ongoing form stops propagating its gliders
the integrated ai safety goal in this context is, I'd like to be able to say, given this cellular automata, let's find the gliders that add up to agents in it, discover what gliders or oscillators or fixed objects they're trying to preserve or create, and see if we can calculate what their boundaries need to be in order to not hurt each other. then figure out what the minimal intervention is that preserves them all, or so. for example, if we intentionally construct bubble of agency that tries to optimize a little piece of artistically-controlled non-agentic art glider system outside that bubble into a given form, can we detect that agency? and can we then recognize what it believes and try to help it out using, eg, a diffusion model, so that the agent has less work to do, but the art is still defined by the in-system agent's work?
Formal verification used on simplified models judiciously to check that the trajectories defined by the objective are stable and not inclined to get in fights with any internally-generated agents
formal verification on trained models (or possibly training algorithms) to check that there are no adversarial examples within an l2 ball according to a given objective
natural abstractions as a reason to believe these other components are working correctly, and primarily relevant to the degree that it helps with ELK
ELK primarily as a way to train the augmentation net to reduce circuitry that interferes with honesty, so that the agent bubbles in the cellular automata are preserved
stretch goal: the augmentation agent is also in the cellular automata, rather than being a separate god-like diffusion model that interferes in the sim
it seems like most of safety boils down to a trustable uncertainty representation. the thing I want to formally verify is that the net knows when to stop and ask the detected agent whether an outcome it expects is appropriate.
if I could truly trust that a system that otherwise has been trained to do what I want will notice if it gets unsure if it's doing what I want, and then stop to double-check its objective, then we're almost done.
I will not bet. I will edit only to clarify.