https://www.safe.ai/blog/forecasting
Some kind of way it is not true in a common sense way. Things that would resolve this yes (draft):
We look at the data and it turns out that information was leaked to the LLM somehow
The questions were selected in a way that chose easy questions
The date of forecast was somehow chosen so as to benefit the LLM
This doesn't continue working over the next year of questions (more accurately than last year of metaculus crowd, ie the crowd can't win because it gets more accurate)
The AI was just accessing forecasts and parroting them.
A recent report from Tetlock shows GPT-4 level models matching crowd median forecasts in a prospective setting: https://arxiv.org/abs/2409.19839. It's unclear how these humans compare to the Metaculus crowd, but it's still a noteworthy result.
@MantasMazeika The models which (almost) match crowd median use freeze values (they are provided with the crowd forecast). It seems they didn't rank static freeze values, I wonder how that would have compared.
The model used in SafeAI's forecaster seems quite bad according to the leaderboard referenced in that paper: LLM Data Table
SafeAI's model was GPT-4o (scratchpad with news), which performed even worse than GPT-4o (scratchpad)
Superforecaster median score: .093
Public median score: .107
Claude-3-5-Sonnet-20240620 (scratchpad with freeze values): 0.111
GPT-4o (scratchpad): 0.128
GPT-4o (scratchpad with news): 0.134
@DanM A few points:
- They did not benchmark our FiveThirtyNine bot. The specific scaffolding used matters a lot, and their scaffolding is the prompt from Halawi et al. This same scaffolding significantly underperformed our scaffolding in our evaluation (0.1221 -> 0.0999 Brier score).
- Claude 3.5 Sonnet w/out freeze values (#12 in the leaderboard) is still within error of the crowd median forecast, although only barely. I agree that it would be interesting to see how the freeze values alone compare.
I haven't had time to dig closely into this study, but I wonder if this is enough to resolve:
https://x.com/metaculus/status/1839069315422646456
@Harlan Note that the Metaculus blog post actually shows that two bots matched the median pro forecasters in their evaluation, which seems to contradict their claim earlier in the post that "AI forecasting does not match ... the performance of experienced forecasters". Regarding how their post bears on our specific evaluation, they state that our "methodology has been highly contested" and link to a LW post by FutureSearch. That post itself doesn't provide new information, but rather repeats and links to several concerns that we fully addressed in our earlier response to Halawi.
Right, okay, question seems to be how to resolve this.
If someone else wants to separately rerun the test, then I think with Halawi's work that would probably be enough for me.
We could enter it into the Metaculus AI forecasting tournament.
I almost think the Platt scaling alone is enough for a yes resolution.
Thoughts? In particular from @AdamK and @DanHendrycks if you want to push back.
@NathanpmYoung Hey, just reaching out to point out our response to Halawi, which seems to not have been linked here yet: https://x.com/justinphan3110/status/1834719817536073992
It turns out their evaluation focused on questions that we explicitly mentioned in our blog post as limitations of the system. When evaluated on the Metaculus subset of their questions, the results actually support our initial claim.
Regarding the phrasing of the question, "substantive issues" strongly suggests that you would think our initial claim is incorrect if you decided that the question should resolve true. It's important to keep in mind what our claim actually is: matching crowd accuracy, subject to certain limitations described in the blog post. I don't think Halawi's post or other points that have been raised support this. Perhaps the question title and resolution criteria could be harmonized to address this.
Thanks everyone for the quick debunking here.
Our team at FutureSearch took our time, and re-read all the papers this year making these claims. Here's our takedown: https://www.lesswrong.com/posts/uGkRcHqatmPkvpGLq/contra-papers-claiming-superhuman-ai-forecasting
https://twitter.com/dannyhalawi15/status/1833295067764953397
Thread finds much worse performance and names a few issues.
The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd Brier score: .141
First issue: The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2. This also happens to be a question in the Metaculus dataset.
I made a poll to test the views of this comment section (and possibly lesswrong) so we can figure out ways to go forward. It takes 2 minutes to fill in.
@NathanpmYoung I was thinking of exactly that - e.g. would it beat the Metaculus community prediction on the next Metaculus tournament - but at this point I think it wouldn't be interesting because it would be in the high 90s and mostly a discount rate question.
Here's a quote from the tool (it landed on 7% chance for the Bills and Chiefs to play in the Super Bowl)
Reflecting on the initial probability, it's important to consider the base rate of any two specific teams meeting in the Super Bowl. Historically, the probability of any two specific teams from the same conference meeting in the Super Bowl is quite low due to the number of variables and potential upsets in the playoffs. The Chiefs and Bills are both top contenders, but the AFC's competitiveness and the single-elimination format of the playoffs reduce the likelihood of both teams making it through. The initial probability of 0.08 (8%) seems reasonable given the strengths of both teams but also the inherent uncertainties and challenges they face. Considering the base rates and the specific strengths and challenges of the Chiefs and Bills, the final probability should be slightly adjusted to account for the competitive nature of the AFC and the playoff structure.
We're pretty sure that contamination from news articles was not an issue in our reported Metaculus evals. Here are guardrails we used:
We inject
before:{date}
into search queriesWe use news search instead of standard search (this will exclude websites like wikipedia, etc.), which are more time-bound and where edits post-publication are clearly marked
We publicly forked newspaper4k to look for updated time beside merely the created time of each article to make sure we correctly filtered based on updated time
Finally, in the released code base, before forecasting we validate the time again (and also reject all of the articles with unknown time)
As an extra check, we also had GPT-4o look over all of the articles we used for each Metaculus forecast, checking whether the publish date or the content of the article leaked information past the forecast date of the model. We also manually looked through several dozen examples. We could not find any instances of contamination from news articles provided to the model. We also couldn't find any instances of prediction market information being included in model sources.
In light of the comments/scrutiny we've received, and the extra checks we did here, I'm much more confident in the veracity of our work. I commit to betting $1K mana on NO at market price over the next 24h to signal this. There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results. The fact that our forecasts are retroactive always opens up the possibility for issues like this, and the gold standard is of course prospective forecasting, but I think we've managed to sanity-check and block sources of error to a reasonable degree.
I'm not sure what standard this market will use to decide whether/how stuff like Platt scaling/scoring operationalizations might count as a "substantive issue," but I'm decently confident that the substance of our work will hold up to historical scrutiny: scaffolded 2024-era language models appear to perform at/above the human crowd level on the Metaculus distribution of forecasting questions, within the bounds of the limitations we have described (such as poor model performance close to resolution). We also look forward to putting a more polished report with additional results on the arxiv at some point.
Long has mentioned that he's happy to answer further questions about the system by email (to long@safe.ai), but we're not expecting to post further clarifications here in the interest of time.
@AdamK I think the whole "your paper falls to replicate" issue is pretty serious RE: 'historical scrutiny', but YMMV
@DavidFWatson Agreed. I look forward to seeing the prospective performance of the system in the coming weeks and months. I can’t speak to Halawi’s claims in particular, but clearly if our system fails to be comparable to the crowd level on new questions (filtered and scored in a manner similar to our evals, assuming the community finds our filtering/scoring choices reasonable), the framing of our claim would have been a mistake.
@AdamK I appreciate you being willing to bet here. If you would like to nominate a trusted mutual party as arbitrator, let me know.
@NathanpmYoung We may have a shared misunderstanding of the magnitude of his offer to bet. I thought he meant he will spend $1k USD, it seems he means 1000 mana, ie, $1 😂, which he has already bet since making this comment.
@DanM yes, I was clearly confused. I was ready to see this market just get absolutely dominated by that bet. Like the "Biden resigns" bet that's just stuck due to one guy's limit order
@AdamK
> There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results
This skepticism seems clearly true to me for ChatGPT-4o-Latest on poe.com. For example, I asked "Where in Nepal was there an earthquake in November 2023?" and it said:
"In November 2023, a significant earthquake struck western Nepal, specifically affecting the Jumla district and surrounding areas. The earthquake, which occurred on November 3, 2023, had a magnitude of 6.4. The tremors were felt across much of western Nepal and parts of neighboring India.
The most affected areas were in the Karnali Province, with districts like Jumla, Dolpa, and Surkhet experiencing notable damage. The earthquake resulted in the loss of lives, injuries, and destruction of property, with aftershocks continuing to cause concern in the days following the main event."
Nov 3 2023, Karnali Province and Jumla are all correct (though the magnitude was 5.7) - the date in particular seems impossible to get without data leakage.
This seems very important, you can't ask it about prediction market questions if it may have been trained on the resolution. I do not think anyone can trust your results without checking for this on the exact 4o checkpoint you used. This may depend on the checkpoint used, and I imagine you used the API not poe.com, so this may not generalise! poe.com also has a "GPT-4o" model which doesn't seem to know as much, so my guess is that it may have leaked in post-training? (I ran out of free credits before I could do much testing)
(Note, if relevant, that I am a NO holder, and think this market is probably overconfident, but it's very important to get this all right!)
@NeelNanda I appreciate your input here. We used gpt-4o 0513. It seems really tricky to determine the extent to which this may have affected retroactive evals, even from looking at the model reasoning traces.
@AdamK A simple test would be to ask it questions like the one I did above, about events in Nov 2023, and see what happens? (you may need some prompt engineering to stop it refusing when it sees Nov 2023 in the prompt, since the system prompt seems to tell it to not answer things post Oct 2023