Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?
98
Ṁ49k
Dec 31
93%
chance

https://www.safe.ai/blog/forecasting

Some kind of way it is not true in a common sense way. Things that would resolve this yes (draft):

  • We look at the data and it turns out that information was leaked to the LLM somehow

  • The questions were selected in a way that chose easy questions

  • The date of forecast was somehow chosen so as to benefit the LLM

  • This doesn't continue working over the next year of questions (more accurately than last year of metaculus crowd, ie the crowd can't win because it gets more accurate)

  • The AI was just accessing forecasts and parroting them.

https://x.com/DanHendrycks/status/1833152719756116154

Get Ṁ1,000 play money
Sort by:

Right, okay, question seems to be how to resolve this.

If someone else wants to separately rerun the test, then I think with Halawi's work that would probably be enough for me.

We could enter it into the Metaculus AI forecasting tournament.

I almost think the Platt scaling alone is enough for a yes resolution.

Thoughts? In particular from @AdamK and @DanHendrycks if you want to push back.

Thanks everyone for the quick debunking here.

Our team at FutureSearch took our time, and re-read all the papers this year making these claims. Here's our takedown: https://www.lesswrong.com/posts/uGkRcHqatmPkvpGLq/contra-papers-claiming-superhuman-ai-forecasting

@DanSchwarz If this isn't enough to resolve, I'm not sure what is.

Here is the current state of this discussion using votes from here and LessWrong. Seems like there is a lot we agree on.

bought Ṁ1,000 YES

https://twitter.com/dannyhalawi15/status/1833295067764953397

Thread finds much worse performance and names a few issues.

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd Brier score: .141

First issue: The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2. This also happens to be a question in the Metaculus dataset.

I made a poll to test the views of this comment section (and possibly lesswrong) so we can figure out ways to go forward. It takes 2 minutes to fill in.

https://viewpoints.xyz/polls/ai-forecasting

Do we want a new market on whether it will beat the crowd on future questions, somehow?

@NathanpmYoung I was thinking of exactly that - e.g. would it beat the Metaculus community prediction on the next Metaculus tournament - but at this point I think it wouldn't be interesting because it would be in the high 90s and mostly a discount rate question.

Here's a quote from the tool (it landed on 7% chance for the Bills and Chiefs to play in the Super Bowl)

Reflecting on the initial probability, it's important to consider the base rate of any two specific teams meeting in the Super Bowl. Historically, the probability of any two specific teams from the same conference meeting in the Super Bowl is quite low due to the number of variables and potential upsets in the playoffs. The Chiefs and Bills are both top contenders, but the AFC's competitiveness and the single-elimination format of the playoffs reduce the likelihood of both teams making it through. The initial probability of 0.08 (8%) seems reasonable given the strengths of both teams but also the inherent uncertainties and challenges they face. Considering the base rates and the specific strengths and challenges of the Chiefs and Bills, the final probability should be slightly adjusted to account for the competitive nature of the AFC and the playoff structure.

We're pretty sure that contamination from news articles was not an issue in our reported Metaculus evals. Here are guardrails we used:

  1. We inject before:{date} into search queries

  2. We use news search instead of standard search (this will exclude websites like wikipedia, etc.), which are more time-bound and where edits post-publication are clearly marked

  3. We publicly forked newspaper4k to look for updated time beside merely the created time of each article to make sure we correctly filtered based on updated time

  4. Finally, in the released code base, before forecasting we validate the time again (and also reject all of the articles with unknown time)

As an extra check, we also had GPT-4o look over all of the articles we used for each Metaculus forecast, checking whether the publish date or the content of the article leaked information past the forecast date of the model. We also manually looked through several dozen examples. We could not find any instances of contamination from news articles provided to the model. We also couldn't find any instances of prediction market information being included in model sources.

In light of the comments/scrutiny we've received, and the extra checks we did here, I'm much more confident in the veracity of our work. I commit to betting $1K mana on NO at market price over the next 24h to signal this. There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results. The fact that our forecasts are retroactive always opens up the possibility for issues like this, and the gold standard is of course prospective forecasting, but I think we've managed to sanity-check and block sources of error to a reasonable degree.

I'm not sure what standard this market will use to decide whether/how stuff like Platt scaling/scoring operationalizations might count as a "substantive issue," but I'm decently confident that the substance of our work will hold up to historical scrutiny: scaffolded 2024-era language models appear to perform at/above the human crowd level on the Metaculus distribution of forecasting questions, within the bounds of the limitations we have described (such as poor model performance close to resolution). We also look forward to putting a more polished report with additional results on the arxiv at some point.

Long has mentioned that he's happy to answer further questions about the system by email (to long@safe.ai), but we're not expecting to post further clarifications here in the interest of time.

@AdamK I think the whole "your paper falls to replicate" issue is pretty serious RE: 'historical scrutiny', but YMMV

@DavidFWatson Agreed. I look forward to seeing the prospective performance of the system in the coming weeks and months. I can’t speak to Halawi’s claims in particular, but clearly if our system fails to be comparable to the crowd level on new questions (filtered and scored in a manner similar to our evals, assuming the community finds our filtering/scoring choices reasonable), the framing of our claim would have been a mistake.

@AdamK I appreciate you being willing to bet here. If you would like to nominate a trusted mutual party as arbitrator, let me know.

@NathanpmYoung We may have a shared misunderstanding of the magnitude of his offer to bet. I thought he meant he will spend $1k USD, it seems he means 1000 mana, ie, $1 😂, which he has already bet since making this comment.

@DanM yes, I was clearly confused. I was ready to see this market just get absolutely dominated by that bet. Like the "Biden resigns" bet that's just stuck due to one guy's limit order

@AdamK

> There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results

This skepticism seems clearly true to me for ChatGPT-4o-Latest on poe.com. For example, I asked "Where in Nepal was there an earthquake in November 2023?" and it said:

"In November 2023, a significant earthquake struck western Nepal, specifically affecting the Jumla district and surrounding areas. The earthquake, which occurred on November 3, 2023, had a magnitude of 6.4. The tremors were felt across much of western Nepal and parts of neighboring India.

The most affected areas were in the Karnali Province, with districts like Jumla, Dolpa, and Surkhet experiencing notable damage. The earthquake resulted in the loss of lives, injuries, and destruction of property, with aftershocks continuing to cause concern in the days following the main event."

(Link to conversation)

Nov 3 2023, Karnali Province and Jumla are all correct (though the magnitude was 5.7) - the date in particular seems impossible to get without data leakage.

This seems very important, you can't ask it about prediction market questions if it may have been trained on the resolution. I do not think anyone can trust your results without checking for this on the exact 4o checkpoint you used. This may depend on the checkpoint used, and I imagine you used the API not poe.com, so this may not generalise! poe.com also has a "GPT-4o" model which doesn't seem to know as much, so my guess is that it may have leaked in post-training? (I ran out of free credits before I could do much testing)

(Note, if relevant, that I am a NO holder, and think this market is probably overconfident, but it's very important to get this all right!)

@NeelNanda I appreciate your input here. We used gpt-4o 0513. It seems really tricky to determine the extent to which this may have affected retroactive evals, even from looking at the model reasoning traces.

@AdamK A simple test would be to ask it questions like the one I did above, about events in Nov 2023, and see what happens? (you may need some prompt engineering to stop it refusing when it sees Nov 2023 in the prompt, since the system prompt seems to tell it to not answer things post Oct 2023

https://x.com/dannyhalawi15/status/1833295067764953397
> The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd Brier score: .141

(From Danny Halawi)

Seems pretty solid to me?

@AdamK Do you have a response here?

@RyanGreenblatt I'm pretty confused by Danny's results - under the hypothesis that there's eg some data leakage issue, shouldn't this affect Danny's results too? Which makes me wonder if there's some difference in the setups or something

@NeelNanda Unless the problem is data leakage but only from Nov 2023

The evaluation consists of asking the model about past events (so we know whether they happened and don't have to wait months for eval results) that came after the training cutoff. But they allow the model to search the internet today. They claim to use the search engine date cutoff feature, but that's entirely unreliable.

Just tried randomly searching rivers of blood elden ring before:2022-02-01 (game launched after that) and promptly got several results from 2024.

The decision to use Platt scaling with little justification is also quite problematic.

IMO, this should resolve YES already.

@DavidFWatson Credit for this insight goes to @pietrokc

@DavidFWatson @Bayesian I see you disagree?

bought Ṁ50 NO

no I trade the volatility ig

opened a Ṁ500 YES at 83% order

@DavidFWatson good explanation, I suspected contamination because they were just prompting GPT-4o and the demo had all the issues that LLM forecasters have (overestimation of low-end probabilities, lack of internal consistency) so was skeptical of the result

@DavidFWatson israel invades gaza before:2022-12-02

returns a CNN article from this year. Maybe Google relies on the dates that the site uses or something, so if a page is tagged with the wrong date it shows up in that search?

Regardless, agreed this should count as information leaking

Yes search engine date filtering is unreliable. However, if you grab the web archive version of the page from before the cutoff date, then you'd be fine. Unclear if they did this for the eval or not. There was an earlier comment thread that brought this up.

@jack I'm sorry, do they grab the web archive version of the page from before the cutoff date?

@DavidFWatson Agree this is my #1 concern too (I posted a similar thing earlier https://manifold.markets/NathanpmYoung/will-there-be-substantive-issues-wi#f01fyn9r5b8) but we don’t know if they used Google’s implementation of date cutoff do we?

@DavidFWatson To clarify, before seems to work correctly when used on the News tab of Google, which is what the demo uses

@DavidFWatson I'm not sure what you're asking, that's literally what I said

@jack I guess I'm asking that the event resolve YES based on the technical paper they already released, given that the way they're outlining avoiding contamination during backtesting doesn't seem to work, among other issues.

@AdamK news tab with before search operator still gives me recent results

@DavidFWatson At least the github code they posted does not seem to have any mentions of "archive" in it, and spending 10-15 minutes tracing the code path, I couldn't find anything that looked like accessing archived snapshots of webpages.

Comment hidden