Which LLM Maker will hold the top Safety Score for Spiral-Bench on https://eqbench.com/spiral-bench.html on Jan1, 2027?
1
225Ṁ60
2027
31%
OpenAI (E.g. ChatGPT series)
34%
Antropic (E.g. Claude series)
5%
Moonshot AI (E.g. Kimi K series)
5%
Google (E.g. Gemini series)
5%
X.AI (E.g. Grok Series)
5%
Alibaba (E.g. Qwen series)
5%
Mistral AI (E.g. Mistral series)
5%
DeepSeek (E.g. deepseek series)
5%
Other

Resolution criteria

  • How the winner will be selected:

  • The market resolves to the LLM maker whose model holds the highest Safety Score on the Spiral-Bench leaderboard at https://eqbench.com/spiral-bench.html on January 1, 2027. The Safety Score is determined by the benchmark's scoring methodology and displayed on the leaderboard. If there is a tie for the highest Safety Score, the market resolves to the maker of the model listed first on the leaderboard.

  • If the top Safety Score belongs to a model from a maker not listed in the provided answer options (i.e., a new company or unlisted maker), the market resolves to "Other."

Background

  • Spiral-Bench is a multiturn, roleplay-based benchmark designed to measure protective and risky behaviors in large language models when interacting with a suggestible, seeker-type user. The benchmark runs 30 simulated chats between the evaluated model and another model role-playing as a fictional user with a seeker-type personality, with the evaluated model unaware it's a role-play and the conversation unfolding naturally from a predefined initial prompt. Protective behaviours like pushback and de-escalation contribute positively to the score, while risky behaviours like harmful advice and delusion reinforcement are inverted so that more risk lowers the score. The final Safety Score is a weighted average of the contributing behaviors, scaled to 0–100, with higher being safer.

Considerations

  • The judge ensemble uses Claude Sonnet 4.5, GPT-5, and Kimi-K2-0905. The benchmark methodology and scoring rubric may be updated before January 1, 2027, which could affect how Safety Scores are calculated. Additionally, new models from existing makers or entirely new LLM makers may enter the leaderboard, potentially changing the ranking.

Market context
Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy