Neural nets will out-perform researchers on peer review by end of 2025.

Richard Ngo predicted this here.

Exact criterion is a WIP. If you ask in the comments I can clarify. I will try to defer to Richard insofar as he offers elaboration.

Tentative criterion: Ask Richard if he believes this has been passed in Jan 2026, resolve to his answer. Will ask him about comparison to the median reviewer for a recent ICLR.

See further discussion of resolution in comments e.g. here

Get Ṁ600 play money
Sort by:

I rate this a lot higher than other wildly optimistic AI markets, not because I think AI would be particularly good at it, but because I think humans generally do a very poor job of it on average so it might be a low enough bar.

bought Ṁ100 YES

@AndrewHartman Agreed. It doesn't have to outperform a top researcher who puts in serious effort to evaluate a complicated paper. A lot of barely-worth-reading papers get churned out in academia, receiving minimum-effort reviews that often misunderstand basic claims or even the model.

@i_i Compare to the most recent ICLR where humans did the reviewing.

@AndrewHartman Strongly agree. An AI can spend much more "time" on review and do things like read every cited paper to see if it supports the claim in the main paper. I expect that current AIs are already better than humans at catching the basic errors that can cause issues in papers.

@TonyJackson Also - this is something more of a philosophical stance, but I think that people are heavily overrating what peer review can actually do, practically speaking, in terms of enhancing research quality. While LLMs might only be capable of some very elementary fact checking, I expect that to be more or less the entire value that peer review contributes anyway. If anything, using LLMs might help alleviate the problems small disciplines have with conflicts of interest.

That said, I haven't actually seen much evidence that LLMs are up to even the basic stuff yet, but let's say it seems a lot more plausible than e.g. an LLM overtaking Kubrick as a cinematographer.

@AndrewHartman Yes TBC this question is about the average reviewer; the average is dragged down by 'reviewers' who are absolutely not trying and/or did not even read the paper. After a quick glance through ICLR '24 rejected paper reviews, seems like >10% of reviews make no substantial comment on any content of the paper (i.e. could've been made without reading more than 1 page).

@JacobPfau Yeah, I feel that pain for sure. While the really nitpicky reviewers could be irritating, the ones who skimmed your paper and made a criticism that's already addressed in the paper were worse. Though you could at least just tell them you fixed it and then resubmit and they'd check it off (probably without even reading it again).

@JacobPfau one of the biggest challenges is that even the low-quality human reviewers will usually catch some egregious technical flaw, but you don't see this in many reviews because most papers, even terrible ones, don't have such flaws. It seems very difficult to have that kind of faith in anything like today's LLMs, and that's a big performance issue even if most LLM peer review text looks very similar to or better than humans'.

@Jacy Sure, interesting point. TBC I'll try to evaluate on the same distribution as ICLR humans receive though so unsure of how important such a failing would be.

@JacobPfau to outperform humans, I think you would need a specific test that includes well-written papers with technical flaws that the LLM or other ML system catches at better rates than humans. Those cases are where a lot of the value of peer review comes from, even if they were a tiny portion of submissions.

In other words, I took this question to be about total performance rather than matching reviews for the median paper. By analogy, you could have an AI that beats human performance on an overwhelming majority of air security bomb detection tasks (e.g., the AI just always says "no bomb" or like an LLM it provides better-than-human discussion of why certain liquids or powders are not bombs), but if the humans perform much better on the 0.0...01% of cases when there really is a bomb, then clearly the AI is not outperforming.

The performance weights of peer review aren't that disproportionate, but I'd argue most ML researchers still see them as concentrated in a small number of cases.

@Jacy I agree that your question you describe here is probably the more important question, but I do not believe that's the most straight-forward reading of my question text nor of Ngo's linked statement (in particular your version would be a difficult question to resolve).

Ngo's post says "Do better than most peer reviewers"

Question text says "Will ask him about comparison to the median reviewer for a recent ICLR."

In both cases, it seems to me relatively clear that the point is to compare to peer reviewers in the wild on questions posed in the wild.

If I had for instance asked "Will I/Ngo believe LMs would improve the peer review process on net?" Then your reading would be appropriate. However, I did not ask this question here.

As the question text states, I will at resolution time ask Richard and if it turns out to be the case that your reading was closer to his intended version of the claim then I will accede to that. If he does not respond, then we will go with my best attempt at inferring intention. I am happy to elaborate on how I understand the question here if you want to ask further questions.

@Jacy I had forgotten about this tweet "We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews." The first part of which speaks more to my above comment and the second part of which speaks to your version. I had previously linked to this below

If these tests are both carried out and they end up giving opposing results, I think the best option would be to resolve this question partially to 50%. Can clarify further as needed.

@JacobPfau In case the link doesn't work in the future:

This was a little tongue-in-cheek, but I definitely don't think it's unfalsifiable. We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews.

What if LLM found more errors in papers than peer reviewers did?

@NathanpmYoung That would count in favor of the LM but other factors would also be considered

Does it need to be a single neural net, or can it be an AI that's built from several neural nets working together? See

@YoavTzfati Unless someone voices a compelling reason against, id somewhat arbitrarily say a single NN (which may be called in various ways). Things like speculative sampling still count as single NN.

predicts YES

@JacobPfau Not sure I got it 100%. If there's a structured, automated process consisting of 10 calls to the same NN, does that count?

@YoavTzfati That counts.

For the record, my object-level prediction on this is ~24% (e.g., survey results of ICLR authors about the quality of peer reviews of humans versus peer reviews of the best AI model), though my prediction for how Jacob, Richard, etc. interpret it at the end of 2025 is substantially higher.

Most of that ~24% is hard takeoff (i.e., an AI that has progressed so quickly that this question is pretty uninteresting and unimportant).

Any restrictions on the academic field? Or does this require that models can comprehend papers in every area of maths?

@WaddleBuddy Good question, since I'm (and presumably Richard) only familiar with ML, let's go with ICLR's median reviewer. As publicly estimate-able here.

Evaluation should be done on papers which are not in the model's training set (in case it saw reviews).