Well-accepted: if there's a definition accepted by even 25% of the research community I'll resolve yes. If there are multiple similar-but-competing definitions covering 50% of the community I'll resolve yes.
Oct 3, 9:12pm: By "formal definition of value alignment" I mean there is a particular mathematical property we can write out such that we're reasonably confident that an AI with that property would in fact be value aligned in the colloquial sense.
So far we have:
add “black person” to 7% of prompts
ban reference to Ukrainian cities
refuse to release weights, to better profit from selling API queries
only allow misspelled references to public figures
I’d say it’s going great, definitely not a bunch of barnacles attaching themselves to 100,0000 ton ship that is AI progress
(This is just a more advanced version of the “what if the car has to decide between swerving to hit 8 grandmas or one stroller” grift.
None of these scenarios or philosophies will matter.
AI will be so powerful a single actor can cause immense destruction — whether from weapons design, propaganda/psy-ops, or the like — long before it “accidentally” violates some ham-fisted “moral principles” encoded in some supposedly “safe” system.
There are no agreed on moral codes for anything else in life—the people who claim to do “AI ethics” are rarely people you’d trust to manage a small team.)