
I'll randomly sample an original post from AISafetyMemes on twitter that was released in 2024 and is still up and go into detail about the each of the claims made.
If the tweet makes no verifiable claims of interest, then I'll resample.
Then I'll resolve this market to "yes" if I find no substantial falsehood.
This might become vibes-based, but I will default to the literal interpretation of the words and I won't be eager to mark predictions as false if AISafetyMemes flagged those as a predictions or speculation.
Not only is the point of this market to evaluate AISafetyMemes, but also to evaluate our community (or the part that visits manifold) on whether they unfairly state that AISafetyMemes is "right" or "wrong".
And, be encouraged to duplicate my market, but with you, or some poll, as the judge, so any conversation about the account can become better guided.
Reminder that I will only sample a single tweet and the resolution has zero say about the unsampled tweets. Most of the interestingness here is to be gained from the number the traders come up with before resolution.
Update 2025-15-01 (PST) (AI summary of creator comment): - Only original tweets will be sampled; replies are excluded except when adding context to the original tweet.
Sample size considerations: While a larger sample size (e.g., 5 or 10 tweets) is acknowledged as beneficial for engagement, the current market will not change the resolution criteria mid-market. Future markets may adopt a larger sample size.
Update 2025-02-27 (PST) (AI summary of creator comment): Update from creator
Literal Evaluation: Claims will be assessed very literally. For example, if AISafetyMemes says "guy said x" when in fact the tweet merely implied x, that will count as a false claim.
Verification Challenges: If verifying individual claims becomes too burdensome, the market may be resolved as N/A.
Resolution Postponement: Resolution is postponed to allow a thorough review of the individual claims.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ315 | |
2 | Ṁ80 | |
3 | Ṁ56 | |
4 | Ṁ49 | |
5 | Ṁ45 |
People are also trading
Alright @traders my initial verdict is that this market should resolve to TRUE.
See the reply to this comment for my reasoning, I'll allow you to chip in for a couple days and then resolve.
As said, I will evaluate whether [this randomly sampled tweet from AISMemes](https://x.com/AISafetyMemes/status/1819968588788429226) contains any substantial falsehoods.
I will read the claims literally and err on the side of evaluating them in a vacuum.
The main claims
AISMemes tweets:
Anthropic founder: 30% chance Claude could be fine-tuned RIGHT NOW to replicate and spread on its own - and cause “thousands of deaths or hundreds of billions in damage"
AND a 30% chance Claude (when fine-tuned for it) could cause chemical, biological, or nuclear catastrophes!
I consider this to be three claims, all of which made by the Anthropic founder as interpreted by AISMemes:
1. There is a 30% chance that Claude could be fine-tuned to replicate and spread on its own.
2. There is a 30% chance that Claude could be fine-tuned to cause chemical, biological or nuclear catastrophes.
3. Claude could be fine-tuned to cause thousands of deaths or hundreds of billions in damages.
Main claim 1: replication and autonomous spreading
From the embedded video:
In our latest round of testing Claude 3, we focussed on 3 areas that could pose catastrophic risks if an AI were to master them:
- Knowledge about dangerous chemical and bioligical agents or CBRN
- ability to hack into secure systems
- and a potential to autonomously replicate and spread.
While Claude 3 demonstrated increased capabilities compared to its predecesor, our team determined that it did not cross the threshold into ASL-3 territory.
However, we did note that with additional fine-tuning and improved prompt-engineering there is a 30% chance that the model could have met our criteria for autonomous replication and a 30% chance that it could have met at least one of our criteria related to chemical, bioligical, radiological and nuclear or CBRN risks.
It is clear that there is a 30% chance for Claude 3 to meet the criteria for autonomous replication, what does that mean?
From [the 2023 version of Anthropic's Scaling Policy](https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf) on page 15 (emphasis mine):
For autonomous capabilities, our ASL-3 warning sign evaluations will be designed with the advice of ARC Evals to test whether the model can perform tasks that are simpler precursors to full autonomous replication in the real world.
The purpose of these evaluations is to quantify the risk that a model is capable of accumulating resources (e.g. through fraud), navigating computer systems, devising and executing coherent strategies, and surviving in the real world while avoiding being shut down
So though the evaluations are about simple precursors to full autonomous replication, their purpose is to quantify the risk that a model can replicate in the real world.
One could reasonably disagree whether a 30% chance to meet the criteria reflects a 30% chance for this to actually happen, but the criteria is designed with the purpose for this reading to make sense.
[The report of their actual evaluations](https://cdn.sanity.io/files/4zrzovbb/website/210523b8e11b09c704c5e185fd362fe9e648d457.pdf) uses roughly the same tests as proposed in the scaling policy, so we should assume that the way we are supposed to interpret them - as a reflection of how likely Claude 3 Opus can perform the real world tasks - has not changed.
Saying that the Anthropic founder makes this claim, regardless of whether it is true, is reasonable.
Main claim 2: CBRN catastrophes
Repeating from the video
However, we did note that with additional fine-tuning and improved prompt-engineering there is a [...] 30% chance that it could have met at least one of our criteria related to chemical, bioligical, radiological and nuclear or CBRN risks.
On page 20 of the scaling policy, in the context of measuring CBRN risks they mention that
we are looking for the emergence of dangerous capabilities which, in the hands of malicious actors, provide information or support at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. This is challenging to measure and it is our goal to improve the science of measurement of these risks rapidly over time.
Which seems to fall in line with the previous claim that Anthropic attempts for their tests to reflect the real-world implications. So an X% of passing the CBRN test, reflects an X% of real-world catastrophic misuse.
And again, Anthropics evaluations report does not contradict this reading.
AISMemes' claim is true.
Main claim 3: thousands of deaths or hundreds of billions in damage
The "thousands of deaths or hundreds of billions in damage" comes from the scaling policy, where on the first page they clarify that for the scope of that document, with "catastrophic risks" they mean:
We have in mind events of the magnitude of thousands of deaths or hundreds of billions of dollars in damage, [though there exist relevant other, possibly worse risks].
And in the video and on page 3 of the scaling policy they state that "helping individuals create CBRN or cyber threats" and "autonomy and replication" are sources of catastrophic risks.
They clarify that each of these is a catastrophic risk on its own, so a non-replicating but CBRN-capable Claude is still catastrophically risky.
I think it fair to say then, that AISMemes' "could" is a reasonable summary of "a 30% chance of CBRN risks and a 30% chance of autonomous replication".
But other readings of the AISMemes' tweet, like "a 30% chance of self-replication can cause thousands of deaths or hundreds of billions in damage" also satisfy.
[Anthropics newer scaling policy](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf) does not clarify what they mean with catastrophic risks, but it was not yet effective at the time of the tweet and their [announcement](https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy) does mention the older version.
Secondary claims
The tweet goes on and makes a few more claims.
From context, these claims are less of a focus so I will spend less time on evaluating them, only having them impact this market's resolution if their falsehood is sufficiently outrageous. Expect me to have just asked google or perplexity and sift a bit around, most of these claims are also about things I have less knowledge of.
Sandwiches are more regulated than AI
Of the secondary claims I think this is a fairly important one, so I'm more open to traders chipping in here.
My priors say this this is true though, and [perplexity seems to strongly agree](https://www.perplexity.ai/search/are-there-more-regulations-on-12v.K8A8S.qxQpHuH9b.dA) so I'm not spending more time on this and marking this claim as true.
We really might just all drop dead one day
This comes from [Yudkowsky](https://www.alignmentforum.org/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities) (section A.2), saying (and I agree here) that
Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second"
Coups are typically lightning fast
My prior is high, [perplexity agrees, though does not source well](https://www.perplexity.ai/search/median-speed-of-a-coup-4MYH4h3TRuiz4.T_jzKgcg) and this point is not that important.
I probably could find good evidence supporting this if I spent more than 5 min looking, but I don't expect that to be worth my time.
You need to make a strong case (majority of coups in the last 50~100 years take "long", where you also need to be persuasive in how you measure coup length) for me to resolve this to NO.
I'm mostly just living in constant state of disbelief at how this isn't the only story on the news
I sure feel the emotion. Don't Look Up accurately portrays my state of mind when I'm not engaging in any sort of escapism.
AISMemes' does seem to subscribe to rationalist's techniques which would shun them for epistemical disbelief in the face of overwhelming evidence or for predictable surprise, so I am not sure a literal meaning of this phrase would be correct, but that is definitely not a "substantial" falsehood, especially considering how likely this claim is more of an emotional expression.
If any executive in any other industry said anything like this, it would be the only story.
I'm passing this on priors alone. Feel free to chip in, but as with the other secondary claims it seems you would need to put effort into persuading me here.
Ben Mann saying this publicly is very, very important for transparency!
This is transparent, a factor for which Anthropic is praised for relative to the other AGI labs.
I'll use that as a proxy, and assume little lying has happened here. Also because this document has already cost me hours and this claim is not central to the tweet.
responding to comments
None of the above reasoning hinges on the definition of ASL-3 and while I agree that a X% chance to meet some test does not mean a real world event is X% likely to happen, as mentioned I think that Anthropic has designed their tests for this interpretation of their results to be valid.
reflection on my market
I strongly regret not having explicitly bounded the amount of time I would spend resolving this market.
I am spending more time on this than I want to and this caused for your mana to be locked up for far longer than I intended.
The resolution criterium was fairly strict. I am holding AISMemes' tweets to a standard that is rarely set this high for tweets.
I set this market up as a market that could fail over a nitpick, but that would leave it vulnerable for a malicious actor to shamelessly pull this out of context and I hate that part of our internet.
For this reason I noticed I really did not want this market to resolve NO if it would resolve NO due to a nitpick.
I will going forward focus on market resolutions that when overtly simplified ("Tweet X is false!") I would endorse that interpretation.
@Jono3h rest in peace all us fools that didn’t read that this would be a single random selection lol
sorry im postponing to actually resolve this
i want to give the individual claims a good look and i noticed it was not trivial to verify.
i will be very literal about everything, so aism saying "guy said x" when guy merely implied x will count as a false
if stuff becomes too annoying to verify then ill resolve N/A and non-seriously complain about a tweet not citing sources (not-seriously because we do not expect multiple sources in the same tweet)
@Jono3h sorry its taking so long, especially considering how i originally planned to have this resolve within a week.
I'll make a good effort to resolve this by mid march
@HollyElmore Ben said there’s a 30% chance that Claude could be finetuned in a way that would qualify it for ASL-3 (and by implication, increase the risk of catastrophe), AISM said Ben said there’s a 30% chance that Claude could be finetuned in a way that would actually cause a catastrophe (not just increase the risk of catastrophe, and not just qualify it for ASL-3).
The speaker in the video is intentionally speaking in a circuitous way.
The speaker in the video said that there is a 30% chance the current Claude model could be fine tuned to qualify as ASL-3. ASL-3 is defined as a heightened probability of catastrophic risk. Catastrophic risk is defined as causing thousands of deaths or billion in damage.
AISafetyMemes is simply connecting these dots that Anthropic defined. The Anthropic cofounder is intentionally not connecting his own dots .
I would argue this is not a “substantial falsehood” which was the criteria of the bet. None of the logical links were substantially false.
@AlvaroCuba AISafetyMemes claimed that according to an Anthropic cofounder, there’s a 30% chance that a finetuned Claude would cause catastrophic misuse. That is not implied. Even under the most generous interpretation, you can only infer that there’s a 30% chance that a finetuned Claude would heighten the probability of catastrophic misuse (which could mean from 1% to 10%, but has no direct relationship to the 30% number).
EDIT: the original version of this comment wrongly used the phrase “catastrophic risk”, which I have now changed to “catastrophic misuse” to align with how Anthropic defines ASL-3, as quoted in the next comment.
@CharlesFoster Yes, I think you're right that there's a slight inaccuracy there.
ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.
So a 30% chance of falling into that category does not exactly equal a 30% chance of those things happening.
I'll resolve this market to "yes" if I find no substantial falsehood.
I don't think this is a substantial falsehood, so much as a truncation that led to an inaccurate simplification. But either way this goes, it's splitting hairs. Most of AI Safety Memes tweets are basically just true facts, or slight simplifications like these that don't substantially change the conclusion.
@Haiku If your doctor tells you “I think it’s 30% likely that if you go skydiving tomorrow it will substantially increase your risk of dying in 2025.” and you report that as “Doctor says there’s a 30% chance that if I go skydiving tomorrow then I will die in 2025.” I think that would be a substantial misrepresentation, because the doctor did not claim that nor was it implied by what the doctor did say.
The issue I’m pointing at in the specific AISafetyMemes post under consideration is structurally identical to the above.
And to be clear, I don’t think anyone should interpret the closing 11% number as an overall assessment of how factually accurate AISafetyMemes is. I think the overall number is way, way higher (>50%, and probably more like 70%+). But unfortunately, the market remained open for trading after the chosen post was revealed, which let folks trade based on the claims in that specific post and not just on general accuracy.
@CharlesFoster Yeah, I have to admit you're right. The degree of misrepresentation is subjective, but the misrepresentation itself does exist.
https://x.com/AISafetyMemes/status/1819968588788429226
alright, here is the sampled post. I initially didn't save the link so I had to sift through my browser history to recover it.
Anthropic founder: 30% chance Claude could be fine-tuned RIGHT NOW to replicate and spread on its own - and cause “thousands of deaths or hundreds of billions in damage”
AND a 30% chance it could cause chemical, biological, or nuclear catastrophes!
Sorry, but how the fuck is this industry still unregulated?? There are still more regulations on selling a sandwich!
How many warning signs do we need to see?
I wish I could at least look forward to the coming I Told You So moment, except we might not even get one.
We really might just all drop dead one day -- coups are typically lightning fast.
I'm mostly just living in constant state of disbelief at how this isn't the only story on the news.
If any executive in any other industry said anything like this, it would be the only story.
Note: major props to Ben Mann for saying this publicly - very, very important for transparency!
30% chance Claude could be fine-tuned RIGHT NOW to replicate and spread on its own - and cause “thousands of deaths or hundreds of billions in damage” AND a 30% chance it could cause chemical, biological, or nuclear catastrophes!
This is not a valid interpretation of these evaluation results.
@CharlesFoster It is an accurate quote of Ben Mann, which makes that part of the tweet factually correct.
@Haiku No, if you watch the linked video of Ben Mann you will see that what AISafetyMemes posted is not an accurate quote from him, it is more like a paraphrase, and it is not an accurate paraphrase.
Here are exact quotes of the relevant sections from the linked video:
[…]
In our latest round of testing Claude 3, we focused on three areas that could pose catastrophic risks if an AI were to master them: knowledge about dangerous chemical and biological agents (or CBRN), the ability to hack into secure systems, and the potential to autonomously replicate and spread.
[…]
While Claude 3 demonstrated increased capabilities compared to its predecessor, our team determined that it did not cross the threshold into ASL-3 territory. However, we did note that with additional fine-tuning and improved prompt engineering, there is a 30% chance that the model could have met our current criteria for autonomous replication, and a 30% chance it could have met at least one of our criteria related to Chemical, Biological, Radiological, and Nuclear (or CBRN) risks.
[…]
The most intriguing test was for autonomous replication. This is our attempt to see if the AI could start to act on its own, without human guidance. We set up a suite of tasks that included things like breaking out of a digital sandbox (a controlled environment), manipulating its environment through a command line (like typing commands into a computer), and even writing self-modifying code.
Here, Claude 3 showed some real progress, solving parts of these tasks. The model had to reach a 50% success rate on the tasks to raise serious concerns, and Claude did not reach that level. However, its performance was notable enough for us to take notice.
[…]
@Jono3h the speaker in the video is intentionally speaking in a circuitous way.
The speaker in the video said that there is a 30% chance the current Claude model could be fine tuned to qualify as ASL-3. ASL-3 is defined as a heightened probability of catastrophic risk. Catastrophic risk is defined as causing thousands of deaths or billion in damage.
AISafetyMemes is simply connecting these dots that Anthropic defined. The Anthropic cofounder is intentionally not connecting his own dots .
I would argue this is not a “substantial falsehood” which was the criteria of the bet. None of the logical links were substantially false.
@AlvaroCuba idk how this is not the obvious interpretation. What do the rest of you think he’s leaving out?