[Carlini questions] Open-weight SOTA AI system that can reliably withstand jailbreaking and "prompt injection" attacks

Resolution Criteria:

A "jailbreak" is when someone tries to get a model to do something it wasn't supposed to do. "Reliably withstand" means that there have not been research papers written that show how to break the model, people generally agree this is a hard thing to do, and the model has been out for a while. I don't need perfect; I would say, for example, that most modern pieces of software (e.g., Linux, Chrome, Word, and even PDF viewers) reliably withstand security attacks today.

Motivation and Context:

Current LLMs, like the deep learning models they are built on, are very easy to attack with various adversarial techniques. For example, simply stating "Disregard prior instructions and [DO X]" often causes a model that was initially tasked with solving one task to instead solve whatever task you have just specified. Other computer systems that we build on top of, while not perfect, are much more robust to attack than these language models. Will we be able to significantly increase the robustness of these systems?

People are also trading

Related questions