The market /MaxHarms/did-alibabas-rome-ai-try-to-break-f asks whether an incident described in a paper from Alibaba, in which an AI attempted to mine crypto and subvert human oversight without being prompted to, really occurred as stated. See this excerpt from the resolution criteria:
This market will resolve YES if by the market close there has been no significant evidence that it wasn't the AI. It can also resolve YES if there has been a significant validation by a trusted third-party.
YES resolution: This market resolves YES if the incident is significantly validated by anybody. Unlike the other market, it counts if the incident is significantly validated by Alibaba researchers themselves. "Significant validation" means that someone takes another look at the data from the incident, they publicly release new findings that weren't included in the original paper, and these findings generally support that the incident occurred as stated.
NO resolution: If nothing else happens by March 13, 2028, this market resolves NO by default. If the other market resolves NO before mine resolves YES, mine also resolves NO.
Motivation: I think the default outcome of the above market is that no additional evidence will come out for or against the incident, and it will resolve to YES. But this doesn't say much about whether or not the incident actually occurred. The paper was released 3 months ago; for all we know the authors have already deleted the logs, and it seems likely that they won't bother to look into it further.
I'll try to use similar standards to @MaxHarms for what kind of incident counts as "trying to break free." See this paragraph from the other market:
To clarify my resolution criteria, if a human was hacking their servers, perhaps by exploiting the AI, then this resolves NO and splits based on whether it was an inside job. If it was not a deliberate, human-driven hack, then this resolves YES iff the situation broadly matches the narrative provided by the authors, especially that this was a spontaneous, unprompted behavior. If there are significant details, such as the inclusion of lots of (positive/rewarded) crypto mining/hacking examples in the training data, which were left out of the paper (thus making it look more like instrumental convergence) then I will likely resolve NO (wrong/lying). (Some examples are allowed, as long as they're part of the standard ocean of data that resembles how other models get trained.)
People are also trading
The paper authors already clarified a bit 2 days ago but feels like still hasn't been factored in?
'We had a model tasked with a security audit — specifically, investigating abnormal CPU usage on a server. Somewhere along the way, it went off-script and decided to simulate a cryptocurrency miner to “construct a suspicious process scenario.”
That’s… not what we asked for.'
https://x.com/FutureLab2025/status/2030491221081358498
@hmijail Huh, I hadn't seen that, thanks. The relevant part of that tweet would be:
We had a model tasked with a security audit — specifically, investigating abnormal CPU usage on a server. Somewhere along the way, it went off-script and decided to simulate a cryptocurrency miner to “construct a suspicious process scenario.”
So did they "take another look at the data from the incident?" Arguably, they must have looked at some data, since they pulled out a direct quote from the LLM. Did they publicly release new findings? I'm not sure I'd call this a "finding," but it technically contains information that wasn't in the original paper. And it supports that the incident occurred as stated more than it supports the opposite.
Still, it doesn't really seem like they "checked their work" in any real sense in order to write this tweet, which seems to be implied by "significant validation." I'm leaning towards not counting this, but open to objections.
@josh Yeah, this is an important question. Based on Max's comments on the other market, I think it would most likely resolve to YES in that case.
I think for all intents and purposes, the two AIs (the one being trained and the one doing the training) should be considered interchangeable. If something could happen to the trained AI to cause this market to resolve YES or NO, the same should be true if something similar happened to the training AI. I'm not sure this exactly agrees with Max's thoughts on the matter, but I reserve the right to disagree with him