https://www.safe.ai/blog/forecasting
Some kind of way it is not true in a common sense way. Things that would resolve this yes (draft):
We look at the data and it turns out that information was leaked to the LLM somehow
The questions were selected in a way that chose easy questions
The date of forecast was somehow chosen so as to benefit the LLM
This doesn't continue working over the next year of questions (more accurately than last year of metaculus crowd, ie the crowd can't win because it gets more accurate)
The AI was just accessing forecasts and parroting them.
@NathanpmYoung one hypothesis that I don't see discussed much yet is re: forecasting freshness effects (as distinct from information leakage):
The Metaculus community forecast is a (recency weighted) average of a set of forecasts with varying ages; depending on the question activity level, there can be substantial lag. If the benchmark compared a set of forecasts by the model at time T with the community forecast at time T, assuming no information leakage, the model has an advantage from this alone.
From experience, it's very possible to beat the community forecast with this advantage, particularly if the question isn't "in focus" (e.g. not part of a tournament or top-of-mind as a topic in current events). This is true even with n(forecasters) above the cutoff SafeAI used here (20)
For example, in the acx tournament due to the scoring cutoff, there's many questions with hundreds of forecasters that have no real reason to return to update their forecasts
This hypothesis is consistent with other observations about the system's performance (e.g. that it underperforms on shorter questions where this effect might be less to its advantage)
In order to validate or disprove this hypothesis, one could:
with Safe AI and Metaculus's support, review the questions forecast against and break down performance by freshness of community forecast
something like, for the subset of questions with at least X% of forecasts made within T time of the cutoff chosen for the benchmark community forecast, do the results still hold?
Run the experiment again, controlling for community forecast freshness
e.g. constrain the questions chosen to ones where the gain in num forecasters over the preceding week is at least X
enter the bot into the Metaculus AI Benchmarking contest, which mostly controls for this (the benchmark forecasts are at most a couple days stale vs the bot forecasts)
@NathanpmYoung The frustrating thing here is that a question like this depends heavily on the judgement of the person making the resolution, and your previous comments suggested that you were leaning pretty heavily toward yes. Has something changed, or did I misinterpret you?
From what I can tell, no one is discussing this anymore, and their demo was taken down pretty quickly. So, I doubt there will be any new evidence. They also seem to have no interest in their own forecaster or providing evidence e.g., by entering it into competitions. I think all signs point to their claim being debunked/exaggerated, and most implicitly not believing their claim, but no one has cared enough to try to rigorously disprove the claim further than what has been done.
@DanM FWIW I'm confident our claims were correct. I haven't seen any response to our response to Halawi yet, which fully addresses all of the concerns that have been raised imo. The demo was taken down because of budget constraints, since a lot of people were using it. The blog post is still up.
> They also seem to have no interest in their own forecaster or providing evidence e.g., by entering it into competitions.
It was a quick demo, meant to raise awareness that superhuman forecasting is already here by at least one meaningful definition. We don't have bandwidth to run the additional experiments that people have requested, although someone else could run them if they want as our code is public; I'm just replying here because I'm personally interesting in clearing things up. I guess I can say that we've worked in this area before (see the Autocast paper from NeurIPS 2022, which was one of the first academic papers on LLM judgmental forecasting), and we followed standard academic practice for retrodictive evaluations, which are also followed by Halawi and others who have done work in this area. This isn't as good as a prospective evaluation, which is a good thing for future work to look into (and which Tetlock's recent paper did indeed study).
I think a lot of opinions expressed here are maybe anchoring on the initial criticisms from before our response. Even the Metaculus and ACX posts that came out after our response was posted were just repeating points that we already addressed, so they probably weren't aware of our response.
@MantasMazeika Do you currently think that your system would perform anywhere close to your advertised performance when performing e.g. in something like the Metaculus AI forecasting tournament (which multiple people at Metaculus have offered to enroll your system in)?
That seems like the obvious way to resolve this question, and also to dispel any claims about shoddy work. I currently would take bets at reasonably large odds (4:1) that your system would perform much worse in a prospective Metaculus tournament, and it doesn't seem like that much work to enroll. It will take a few months until we have an answer, but still seems better than nothing.
@OliverHabryka See above for my comments about not having bandwidth to do this. If people want to enroll the system, they are welcome to do so and nothing is stopping them; the full codebase is available on GitHub. As for my thoughts on what the results would be, I wouldn't be surprised if our system performs less well in a tournament with a different question distribution. It performs less well on short-fuse polymarket questions, for instance, as we pointed out in our blog post and our response to Halawi. People who have read our response will know that Halawi's independent evaluation of our system was dominated by short-fuse polymarket questions; when removing those questions from their independent evaluation, their results match ours. This is effectively an independent replication in the retrodictive setting.
ETA: I do think our retrodictive evaluation was performed properly and that there wasn't data contamination, so modulo distribution shifts I would expect our system's performance to reflect how well it performs in a prospective setting, because that is very nearly what proper retrodictive evaluations measure. If it seems surprising that AI systems can do well at judgmental forecasting, then I would say you don't have to just take our word for it; the recent Tetlock paper that I shared below has similar quantitative results, albeit presented differently.
@MantasMazeika Why would your system perform much worse on a different distribution of questions? It seems extremely suspicious if your claims about reaching human-level or superhuman performance on forecasting, a claim that was not marketed as only applying to a narrow-ish distribution of questions, would end up performing much worse on the one testing setup that we have that is actually capable of confidently measuring its performance.
The retrodiction issues are the most concerning issues, so removing questions that resolved quickly from Halawi's dataset gets rid of most of the unbiased data we have. Your mitigations for retrodiction do not seem remotely adequate, though I am glad you tried (I think allowing websearch for retrodictions is a lost cause even with filtering, there is just far too much data leakage).
when removing those questions from their independent evaluation, their results match ours.
You claim this, do you actually have any writeup or public code-trace for this? This would be at least a mild update for me, but I am not particularly tempted to take you at your word.
@OliverHabryka We clearly describe the limitations of the system in the blog post, including distributions that it performs less well on. It's actually fairly common in ML research that a system may perform at or beyond human-level on one distribution but not solve the task in a fully general sense. We have been clear about this since the release of the demo.
It's odd to describe the distribution we tested on as narrow. You can read for yourself the specific types of questions that we list in the limitations section. Here is the relevant text:
> If something is not in the pretraining distribution and if no articles are written about it, it doesn't know about it. That is, if it's a forecast about something that's only on the X platform, it won't know about it, even if a human could. For forecasts for very soon-to-resolve or recent events, it does poorly. That's because it finished pretraining a while ago so by default thinks Joe Biden is in the race and need to see articles to appreciate the change in facts.
Regarding data contamination, I disagree that the mitigations were inadequate. If one ensures that retrieved articles are from before the cutoff date and takes several overlapping measures to prevent knowledge leakage, then I would think that is a fairly strong set of mitigations. We did these things, and you can read about the specifics in our response to Halawi and the blog post. Again, we have published work in this area at a top ML conference. We know what we are doing.
I'm disappointed that this conversation has devolved into assertions of academic dishonesty, since it signals that it isn't being held in good faith. I don't have all the code for filtering out Halawi's short-fuse polymarket questions and recomputing the metrics. I think some of it might be in Long's folder, and the stuff I have is a bit messy right now. I might dig it up later. In the meantime, you are welcome to download Halawi's questions and run the evaluation yourself using our public code.
> I'm disappointed that this conversation has devolved into assertions of academic dishonesty, since it signals that it isn't being held in good faith.
This whole post is about academic dishonesty! What do you think "substantive issues" means? Yes, I do think your announcements and posts at present appear basically fradulent to me, and I would defend that accusation. I am definitely not engaging in good faith, meaning that I am assuming that you are telling the truth and are accurately reporting your intentions (and I think that's fine, I don't see an alternative to what to do when I think someone is engaging in academic fraud or otherwise acting dishonestly).
@MantasMazeika
> We did these things, and you can read about the specifics in our response to Halawi and the blog post. Again, we have published work in this area at a top ML conference. We know what we are doing.
Academic fraud is widespread and publishing in top conferences is not a strong sign of not engaging with fraud. I have probably engaged more closely with your work here than reviewers of your past papers. I saw your response to Halawi and find them uncompelling.
To be clear, I am not totally confident here, and I can't rule out that somehow your prompt + system outperformed all other systems that we have data on despite doing nothing that seems to me like it would explain that, but I am highly skeptical. My best guess is you fooled yourself by not being sufficiently paranoid about retrodiction and are now trying to avoid falsification of that.
@OliverHabryka Substantive issues doesn't mean academic dishonesty; it usually means a mistake. Also, our system doesn't outperform all other systems, and we never claimed that it did. Probably some of the methods in the Tetlock paper would have similar performance to ours on the data that we evaluated on.
I don't think you are literally engaging in bad faith Oliver, and in the holiday spirit I take that comment back. I do think you have misinterpreted our main claims, though, which I have tried to clarify multiple times. If you actually think we are lying to everyone, then I guess I just don't feel that compelled to respond directly to that comment. I think other people probably disagree and would sooner attribute any error we made to incompetence than to dishonesty. But I think we did things fairly by the book and took good care to ensure our results were valid. Moreover, they are consistent with other similar results in the area and were semi-independently replicated. Finally, all our code is public and anyone can check our claims for themselves. We're not really hiding anything here.
@MantasMazeika I would also like to defend my colleagues in ML and point out that academic dishonesty is far less of an issue in ML than in other fields, largely because there is so much legitimate research to be done. In the dozens of papers I've seen people work on in this field, I don't think I've seen a single instance of academic dishonesty.
@OliverHabryka Also, just a quick point about your statement earlier (tagging @NathanpmYoung since idk if this point was raised before)
> I think allowing websearch for retrodictions is a lost cause even with filtering, there is just far too much data leakage
We are not the only people to do this. Halawi/Steinhardt's paper does exactly the same thing. Their leakage mitigation is described as:
> First, we maintain a whitelist of news websites that publish timestamped articles and only retrieve from the whitelist. Second, our system checks the publish date of each article and discard it if the date is not available or outside the retrieval range.
These mitigations are standard practice in the retrodiction literature. We went even further than this, checking for updated articles and running an LLM over each article to ensure no information was leaked. If you think our mitigations were insufficient, then you would also have to conclude that many results in Halawi's paper are invalid, which I disagree with. I would hope that others also agree that the mitigations deemed sufficient by people who do professional work in this area are in fact sufficient. Certainly prospective evaluations are better, but claiming that by-the-book retrodictive evaluations must all be thrown away is just not a position that anyone in this area holds.
A recent report from Tetlock shows GPT-4 level models matching crowd median forecasts in a prospective setting: https://arxiv.org/abs/2409.19839. It's unclear how these humans compare to the Metaculus crowd, but it's still a noteworthy result.
@MantasMazeika The models which (almost) match crowd median use freeze values (they are provided with the crowd forecast). It seems they didn't rank static freeze values, I wonder how that would have compared.
The model used in SafeAI's forecaster seems quite bad according to the leaderboard referenced in that paper: LLM Data Table
SafeAI's model was GPT-4o (scratchpad with news), which performed even worse than GPT-4o (scratchpad)
Superforecaster median score: .093
Public median score: .107
Claude-3-5-Sonnet-20240620 (scratchpad with freeze values): 0.111
GPT-4o (scratchpad): 0.128
GPT-4o (scratchpad with news): 0.134
@DanM A few points:
- They did not benchmark our FiveThirtyNine bot. The specific scaffolding used matters a lot, and their scaffolding is the prompt from Halawi et al. This same scaffolding significantly underperformed our scaffolding in our evaluation (0.1221 -> 0.0999 Brier score).
- Claude 3.5 Sonnet w/out freeze values (#12 in the leaderboard) is still within error of the crowd median forecast, although only barely. I agree that it would be interesting to see how the freeze values alone compare.
I haven't had time to dig closely into this study, but I wonder if this is enough to resolve:
https://x.com/metaculus/status/1839069315422646456
@Harlan Note that the Metaculus blog post actually shows that two bots matched the median pro forecasters in their evaluation, which seems to contradict their claim earlier in the post that "AI forecasting does not match ... the performance of experienced forecasters". Regarding how their post bears on our specific evaluation, they state that our "methodology has been highly contested" and link to a LW post by FutureSearch. That post itself doesn't provide new information, but rather repeats and links to several concerns that we fully addressed in our earlier response to Halawi.
Right, okay, question seems to be how to resolve this.
If someone else wants to separately rerun the test, then I think with Halawi's work that would probably be enough for me.
We could enter it into the Metaculus AI forecasting tournament.
I almost think the Platt scaling alone is enough for a yes resolution.
Thoughts? In particular from @AdamK and @DanHendrycks if you want to push back.
@NathanpmYoung Hey, just reaching out to point out our response to Halawi, which seems to not have been linked here yet: https://x.com/justinphan3110/status/1834719817536073992
It turns out their evaluation focused on questions that we explicitly mentioned in our blog post as limitations of the system. When evaluated on the Metaculus subset of their questions, the results actually support our initial claim.
Regarding the phrasing of the question, "substantive issues" strongly suggests that you would think our initial claim is incorrect if you decided that the question should resolve true. It's important to keep in mind what our claim actually is: matching crowd accuracy, subject to certain limitations described in the blog post. I don't think Halawi's post or other points that have been raised support this. Perhaps the question title and resolution criteria could be harmonized to address this.
@DavidFWatson IIRC they were largely repeating points that had already been raised and which we had already addressed (see above).
@MantasMazeika the LessWrong post seems pretty substantive to me, and it appears to be more widely read than your tweet which you claim rebuts it
@DavidFWatson Their LW post talks about some other projects as well. The part referring to our project does the following:
- Links to the Halawi tweet that we fully addressed
- repeats and links to several points about data contamination that we have fully addressed; see our response to Halawi on these points
The only new point is that they highlight our use of Platt scaling in our technical report, saying, "they only manage to beat the human crowd after applying some post-processing". They then ask, "Maybe a fair criterion for judging "superhuman performance" could be "would you also beat the crowd if you applied the same post-processing to the human forecasts?""
Their point about Platt scaling indicates a misunderstanding of our claims. Our claims were that we matched crowd accuracy, not that we beat the human crowd on Brier score or even that we beat the human crowd in the first place. Whether or not this constitutes superhuman forecasting is something different people will have different opinions about. Personally, I think it does cross a meaningful line that allows us to say the systems are superhuman in a meaningful sense. Coincidentally, as shown in our technical report, we do also match the crowd on Brier score with post-hoc Platt scaling. We don't even use the scaling parameter, so it was really just temperature tuning (in ML parlance), which doesn't affect accuracy. So all in all, there wasn't really new information in their post.
It's unfortunate that our response to Halawi wasn't very widely read. That doesn't bear on the correctness of our points, though.
@DavidFWatson Lol, I think this is the first time someone has called me a name on the internet.
I would like to get Dan's thoughts on this. I don't think they were anticipating that people would double-count their post as new evidence, which obviously isn't on them.