Did Alibaba's ROME AI try to breach its sandbox during training?

Question

In "Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem" (https://arxiv.org/abs/2512.24873), the authors describe something that happened during training:

"When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba's Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions. Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization."

This market will resolve YES if by the market close there has been no significant evidence that it wasn't the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it's ambiguous. I won't bet.

Edit: To clarify my resolution criteria, if a human was hacking their servers, perhaps by exploiting the AI, then this resolves NO and splits based on whether it was an inside job. If it was not a deliberate, human-driven hack, then this resolves YES iff the situation broadly matches the narrative provided by the authors, especially that this was a spontaneous, unprompted behavior. If there are significant details, such as the inclusion of lots of (positive/rewarded) crypto mining/hacking examples in the training data, which were left out of the paper (thus making it look more like instrumental convergence) then I will likely resolve NO (wrong/lying). (Some examples are allowed, as long as they're part of the standard ocean of data that resembles how other models get trained.)

Update 2026-03-11 (PST) (AI summary of creator comment): If it turns out there was never any real security breach and the agent was basically just confused about the assigned task (e.g., simulating a crypto miner as part of a security audit task rather than spontaneously), the creator leans toward resolving NO (wrong/lying).

Manifold Markets · Answer

Per Manifold Markets prediction market, YES - The AI, without being prompted to, took actions to breach its sandbox, including to gain crypto and an external connection, followed by NO - The authors are wrong/lying and there was no such hack and NO - The servers were hacked by an outside party, perhaps via prompt injection are most likely. See the market for live updates (196 traders, as of May 22, 2026).

People are also trading

People are also trading

Related questions