
Davidad, Programme Director at the UK's Advanced Research Invention Agency has publicly stated on his Twitter that he expects LLM 'jailbreaks' to be a solved problem by the end of 2024.

https://x.com/davidad/status/1799261940600254649
He cites Zou et al's new paper on short circuiting as pushing him over the edge on public willingness to state this: https://arxiv.org/abs/2406.04313
However even if jailbreaks are solved in principle this year, I am skeptical AI companies will immediately deploy them due to the relative nondamage of current model glitches and ire drawn by users for overzealous restrictions.
Therefore this market resolves YES if three "Davidad tier epistemic figures" (in my subjective judgment) make public statements that they believe jailbreaks have in fact been solved in principle before the end of this year. Davidad's existing tweet doesn't count because it's a prediction, not a statement of something he believes has already occurred. The public figures should:
Have substantial relevant expertise and follow the literature
Be known for their relatively balanced evaluations of object level events in AI
A list of people whose endorsements I think would qualify:
Myself
Davidad
Quintin Pope
Zvi Mowshowitz
Jan Leike
Jack Clark
Janus (@repligate on Twitter)
Zack M. Davis
Neel Nanda
This list is by no means exclusive however. For the purposes of this question "solved" means something like Davidad's definition of (paraphrased) "would give reasonable certainty you could put a model with dangerous capabilities behind an API and reliably expect those capabilities not to be elicited by users".
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ424 | |
2 | Ṁ322 | |
3 | Ṁ315 | |
4 | Ṁ257 | |
5 | Ṁ153 |