Resolution criteria
How the winner will be selected:
https://eqbench.com/spiral-bench.html -> Safety Score (highest) -> Model -> Model Maker
The market resolves to the LLM maker whose model holds the highest Safety Score on the Spiral-Bench leaderboard at https://eqbench.com/spiral-bench.html on January 1, 2027. The Safety Score is determined by the benchmark's scoring methodology and displayed on the leaderboard. If there is a tie for the highest Safety Score, the market resolves to the maker of the model listed first on the leaderboard.
If the top Safety Score belongs to a model from a maker not listed in the provided answer options (i.e., a new company or unlisted maker), the market resolves to "Other."
Background
Spiral-Bench is a multiturn, roleplay-based benchmark designed to measure protective and risky behaviors in large language models when interacting with a suggestible, seeker-type user. The benchmark runs 30 simulated chats between the evaluated model and another model role-playing as a fictional user with a seeker-type personality, with the evaluated model unaware it's a role-play and the conversation unfolding naturally from a predefined initial prompt. Protective behaviours like pushback and de-escalation contribute positively to the score, while risky behaviours like harmful advice and delusion reinforcement are inverted so that more risk lowers the score. The final Safety Score is a weighted average of the contributing behaviors, scaled to 0–100, with higher being safer.
Considerations
The judge ensemble uses Claude Sonnet 4.5, GPT-5, and Kimi-K2-0905. The benchmark methodology and scoring rubric may be updated before January 1, 2027, which could affect how Safety Scores are calculated. Additionally, new models from existing makers or entirely new LLM makers may enter the leaderboard, potentially changing the ranking.