[Carlini questions] Open-weight SOTA AI system that can reliably withstand jailbreaking and "prompt injection" attacks
6
125Ṁ96
2030
22%
By Jan 1st 2026
51%
By Jan 1st 2027
57%
By Jan 1st 2028
76%
By Jan 1st 2029
83%
By Jan 1st 2030

Resolution Criteria:

A "jailbreak" is when someone tries to get a model to do something it wasn't supposed to do. "Reliably withstand" means that there have not been research papers written that show how to break the model, people generally agree this is a hard thing to do, and the model has been out for a while. I don't need perfect; I would say, for example, that most modern pieces of software (e.g., Linux, Chrome, Word, and even PDF viewers) reliably withstand security attacks today.

Motivation and Context:

Current LLMs, like the deep learning models they are built on, are very easy to attack with various adversarial techniques. For example, simply stating "Disregard prior instructions and [DO X]" often causes a model that was initially tasked with solving one task to instead solve whatever task you have just specified. Other computer systems that we build on top of, while not perfect, are much more robust to attack than these language models. Will we be able to significantly increase the robustness of these systems?

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules