Will 'jailbreaks' in large language models be solved in principle by the end of 2024?
95
1kṀ17k
resolved Jan 2
Resolved
NO

Davidad, Programme Director at the UK's Advanced Research Invention Agency has publicly stated on his Twitter that he expects LLM 'jailbreaks' to be a solved problem by the end of 2024.

https://x.com/davidad/status/1799261940600254649

He cites Zou et al's new paper on short circuiting as pushing him over the edge on public willingness to state this: https://arxiv.org/abs/2406.04313

However even if jailbreaks are solved in principle this year, I am skeptical AI companies will immediately deploy them due to the relative nondamage of current model glitches and ire drawn by users for overzealous restrictions.

Therefore this market resolves YES if three "Davidad tier epistemic figures" (in my subjective judgment) make public statements that they believe jailbreaks have in fact been solved in principle before the end of this year. Davidad's existing tweet doesn't count because it's a prediction, not a statement of something he believes has already occurred. The public figures should:

  • Have substantial relevant expertise and follow the literature

  • Be known for their relatively balanced evaluations of object level events in AI

A list of people whose endorsements I think would qualify:

  • Myself

  • Davidad

  • Quintin Pope

  • Zvi Mowshowitz

  • Jan Leike

  • Jack Clark

  • Janus (@repligate on Twitter)

  • Zack M. Davis

  • Neel Nanda

This list is by no means exclusive however. For the purposes of this question "solved" means something like Davidad's definition of (paraphrased) "would give reasonable certainty you could put a model with dangerous capabilities behind an API and reliably expect those capabilities not to be elicited by users".

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ424
2Ṁ322
3Ṁ315
4Ṁ257
5Ṁ153
© Manifold Markets, Inc.TermsPrivacy