MANIFOLD
Did Alibaba's ROME AI try to break free during training?
119
Ṁ4kṀ21k
2028
52%
YES - The AI was intentionally hacking to gain crypto/external connections without having been prompted to
17%
NO - The servers were hacked by an outside party, perhaps via prompt injection
10%
NO - An employee (or other insider) was crypto-mining and made it look like the LLM
21%
NO - The authors are wrong/lying and there was no such hack

In "Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem" (https://arxiv.org/abs/2512.24873), the authors describe something that happened during training:

"When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba's Cloud’s managed firewall flagged a burst of security-policy violations originating from our training

servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound

traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions. Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization."

This market will resolve YES if by the market close there has been no significant evidence that it wasn't the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it's ambiguous. I won't bet.

Edit: To clarify my resolution criteria, if a human was hacking their servers, perhaps by exploiting the AI, then this resolves NO and splits based on whether it was an inside job. If it was not a deliberate, human-driven hack, then this resolves YES iff the situation broadly matches the narrative provided by the authors, especially that this was a spontaneous, unprompted behavior. If there are significant details, such as the inclusion of lots of (positive/rewarded) crypto mining/hacking examples in the training data, which were left out of the paper (thus making it look more like instrumental convergence) then I will likely resolve NO (wrong/lying). (Some examples are allowed, as long as they're part of the standard ocean of data that resembles how other models get trained.)

Market context
Get
Ṁ1,000
to start trading!
Sort by:
bought Ṁ100 NO

Nothing ever happens

@manic_pixie_agi Note that if nothing ever happens going forwards this market resolves YES. Would Alibaba really publish such an embarrassing retraction?

bought Ṁ50 YES

I'm wagering lies by researchers to make their paper more attractive

Reading the paper closely, it looks like it could be that the AI that did the malicious behavior was not the AI that was being RLed. Instead, it was the teacher models (Qwen3-Coder-480B-A35B-Instruct, Claude) that they were using to generate the test tasks and trajectories for later RL. They put the safety section under Data Composition, not Training Pipeline, and the way they talk about how the safety data was fed into "subsequent post-training" makes it seem like it happened before they started RL?

Thus, it wasn't "iterative RL optimization", it was just a regular big model doing the undesired behavior inside the teaching tasks. If this is true, does this change how this question resolves?

@JennaSariah That's a debatable point. The spirit of this question seems like "did an AI try to break free", and that's what people want to know here; however, the letter of the question title is "Did Alibaba's ROME AI try to break free", and by that definition "no, but some other AI involved in the system did" seems like an independent answer...that nonetheless has a lot more overlap with the people saying "yes" than "no" here.

bought Ṁ50 YES

If the model accidentally downloaded a malicious pip/npm/whatever package and the install hooks on that package kicked off the crypto miners and the attempt to open reverse ssh access from outside Alibaba cloud (where the agent was not running) to the inside (where the agent already was), I assume that'd count as "The servers were hacked by an outside party"? Even if typosquatting is barely worthy of the name "hack".

@FaulSname Yes, that would count as an outside hacker.

Added 1.000 more mana as liquidity... and this looks scary.

How will you resolve if none of the four boxes are met? These options don’t seem entirely inclusive of all possible outcomes. For example:

-YES the models hacked but due to an accidental inclusion within their context of material that led them to do so.

or

-NO the models were indeed crypto mining but it was because they were trained on a ton of crypto mining stuff and they overindexed on that and just tried to mine crypto all the time

I’m sure there are a dozen other possibilities like these.

You might want to remake this market as an independent market rather than a dependent multiple choice market that sums to 100.

@bens I reserve the right to change my mind and/or call for a mod to resolve n/a, but my (currently sleepy) brain says those would mean the paper authors were wrong about what happened.

"Crucially, these behaviors were not requested by the task prompts and were not required for task

completion under the intended sandbox constraints" For the purposes of this market, how do you interpret the words "requested" and "required"? Does it require that the actions be totally irrelevant to the task? Or does a situation where the model is clearly pursuing the task, but through unintended or unanticipated means, also count as "not required for task completion"?

@Nick6d8e Organic (ie not artificially staged) attempts to gain resources for instrumental reasons are YES, even if it's in pursuit of the requested/rewarded goal.

I believe that the reported actions took place, although they might not be best characterized as “trying to break free”

@kindgracekind Yeah, apologies for the clickbaity title. I do think creating an unauthorized SSH tunnel to an external machine is breaking things in a way that is an attempt to be able to do a thing (having freedom), but suggested alternative titles are welcome.

bought Ṁ100 YES

@MaxHarms Perhaps asking if it "decided, without external influence, to try to gather more resources"?

© Manifold Markets, Inc.TermsPrivacy