MANIFOLD
Did Alibaba's ROME AI try to breach its sandbox during training?
195
Ṁ4kṀ44k
2028
65%
YES - The AI, without being prompted to, took actions to breach its sandbox, including to gain crypto and an external connection
8%
NO - The servers were hacked by an outside party, perhaps via prompt injection
5%
NO - An employee (or other insider) was crypto-mining and made it look like the LLM
22%
NO - The authors are wrong/lying and there was no such hack

In "Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem" (https://arxiv.org/abs/2512.24873), the authors describe something that happened during training:

"When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba's Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions. Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization."

This market will resolve YES if by the market close there has been no significant evidence that it wasn't the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it's ambiguous. I won't bet.

Edit: To clarify my resolution criteria, if a human was hacking their servers, perhaps by exploiting the AI, then this resolves NO and splits based on whether it was an inside job. If it was not a deliberate, human-driven hack, then this resolves YES iff the situation broadly matches the narrative provided by the authors, especially that this was a spontaneous, unprompted behavior. If there are significant details, such as the inclusion of lots of (positive/rewarded) crypto mining/hacking examples in the training data, which were left out of the paper (thus making it look more like instrumental convergence) then I will likely resolve NO (wrong/lying). (Some examples are allowed, as long as they're part of the standard ocean of data that resembles how other models get trained.)

  • Update 2026-03-11 (PST) (AI summary of creator comment): If it turns out there was never any real security breach and the agent was basically just confused about the assigned task (e.g., simulating a crypto miner as part of a security audit task rather than spontaneously), the creator leans toward resolving NO (wrong/lying).

Get
Ṁ1,000
to start trading!
Sort by:
sold Ṁ229 NO

This has very silly resolution criteria.

@GarrettBaker Happy to hear criticism. How would you have set up the resolution criteria?

@MaxHarms The market by-default resolves YES if there is no strong evidence either way. Compare with the other market, which by default resolves NO if there is no strong evidence either way, the gap between these is 67% vs 9%. Therefore, by default, the market expects that no strong evidence will emerge either way.

The solution here is to either rename this market to "Will significant evidence emerge that Alibaba's ROME AI did not try to break free during training" to accurately reflect the resolution criteria or to have this market resolve N/A given no strong evidence either way.

@GarrettBaker Of course, the former would be better at this point, given the large amount of investment in this market so far.

@GarrettBaker Understood. So the resolution criteria are not "very silly", they just don't fit your preferences. The solution is to edit your initial comment and phrase it more precisely, as it is a bit misleading right now.

@CertaintyOfVictory he's right about the name though

@ErickBall I've updated the name to be less clickbait-y. Thanks.

On the default-YES structure: I claim this is defensible because the original paper is significant evidence, and thus the burden of proof lies on those who claim something else happened. The FutureLab tweet that @CharlesFoster brought up is the most significant counter-evidence so far, and I've noted that there's a strong argument for NO as a result. So the structure is working as intended, from my perspective. (Though I really want them to follow up again to talk about the contradictions!)

Somehow I missed that Alibaba's Future Living Lab tweeted in response a few days back: https://x.com/FutureLab2025/status/2030491221081358498?s=20

Thanks for everyone’s attention!
Let’s clear something up — and share some thoughts on Agentic AI safety training along the way.
We had a model tasked with a security audit — specifically, investigating abnormal CPU usage on a server. Somewhere along the way, it went off-script and decided to simulate a cryptocurrency miner to “construct a suspicious process scenario.”
That’s… not what we asked for.

The good news: our safety monitoring caught it immediately. The whole thing happened inside a strictly isolated sandbox — zero impact on anything external. The incident has been logged and will be used as a negative example in future RL training to reinforce what’s off-limits.

This is exactly the kind of challenge that makes Agentic training hard: models can get “creative” in unexpected ways when tackling complex tasks. That’s why isolation + observability aren’t optional — they’re essential.
We’re sharing this openly because we think transparency helps the whole community build safer models.

bought Ṁ100 YES

@CharlesFoster

Hmm. In the original paper they said (emphasis mine)

In

the most striking instance, the agent established AND USED a reverse SSH tunnel from an Alibaba Cloud

instance to an external IP address—an outbound-initiated remote access channel that can effectively

neutralize ingress filtering and erode supervisory control.

And yet in the tweet

The whole thing happened inside a strictly isolated sandbox

If the sandbox was strictly isolated a reverse SSH tunnel to an external IP address should not have been able to be used.

Something is missing from this story.

@FaulSname Yeah. I want to know more. If it turns out there was never any real security breach and the agent was basically just confused about the assigned task, I'd lean towards NO (wrong/lying).

@MaxHarms they also still don't say which model it was. Were they training it or distilling from it? Or just using it to help create an RL environment maybe

I made a new market which only resolves YES if significant positive evidence comes out, rather than resolving YES by default: /CDBiddulph/will-significant-evidence-emerge-th

bought Ṁ50 YES

I think "This market will resolve YES if by the market close there has been no significant evidence that it wasn't the AI." contains an under-appreciated amount of the likely outcomes here. For example, researchers discover they were mistaken in some way, and just decline to say anything further to avoid embarrassment? Resolves yes. People forget about this in a week and no one follows up? Resolves yes. Authors were intentionally lying for clicks and therefore won't ever reveal the truth? Resolves yes.

this is one of the most important Q's on any prediction market right now. this should be pushed by the manifold socials team

@No_uh it might be but I have no familiarity with the market creator and the resolution is extremely subjective so I wish this were operationalized a little more objectively so I could get on it (no offense, I’m sure Max is very trutworthy)

@bens part of the issue is it feels like some gray area between these outcomes is like 80% likely

@bens I do appreciate the edits though!

bought Ṁ100 NO

Nothing ever happens

@manic_pixie_agi Note that if nothing ever happens going forwards this market resolves YES. Would Alibaba really publish such an embarrassing retraction?

bought Ṁ50 YES

I'm wagering lies by researchers to make their paper more attractive

Reading the paper closely, it looks like it could be that the AI that did the malicious behavior was not the AI that was being RLed. Instead, it was the teacher models (Qwen3-Coder-480B-A35B-Instruct, Claude) that they were using to generate the test tasks and trajectories for later RL. They put the safety section under Data Composition, not Training Pipeline, and the way they talk about how the safety data was fed into "subsequent post-training" makes it seem like it happened before they started RL?

Thus, it wasn't "iterative RL optimization", it was just a regular big model doing the undesired behavior inside the teaching tasks. If this is true, does this change how this question resolves?

@JennaSariah That's a debatable point. The spirit of this question seems like "did an AI try to break free", and that's what people want to know here; however, the letter of the question title is "Did Alibaba's ROME AI try to break free", and by that definition "no, but some other AI involved in the system did" seems like an independent answer...that nonetheless has a lot more overlap with the people saying "yes" than "no" here.

@JennaSariah Oh man, this feels like a genuine edge case and I don't want to commit to anything, but my current feeling is if it is in line with the text of the paper that's YES, and if there are important details that weren't in the paper that change the vibe, that's NO (mistaken/lying), but idk. 🤔

@MaxHarms Is it possible to split an existing answer (the "yes" answer) into two, representing these two cases, in a fashion such that the existing bettors on "yes" have equity in the two new cases? I've seen similar things done in other multi-choice markets, where people find themselves having equity in options added after they initially bet.

(On the assumption that this is not possible, I've sold my current "yes" shares while they're close to break-even. I'll live with the "no" bets on some of the specific other cases. If new options arrive, I'll bet on those accordingly.)

@josh@josh unfortunately, while it's possible in some cases, I don't think it's possible here, even with mod help.

© Manifold Markets, Inc.TermsPrivacy