Neural nets will out-perform researchers on peer review by end of 2025.
42
399
800
2026
19%
chance

Richard Ngo predicted this here.

Exact criterion is a WIP. If you ask in the comments I can clarify. I will try to defer to Richard insofar as he offers elaboration.

Tentative criterion: Ask Richard if he believes this has been passed in Jan 2026, resolve to his answer. Will ask him about comparison to the median reviewer for a recent ICLR.

Get Ṁ1,000 play money
Sort by:

I rate this a lot higher than other wildly optimistic AI markets, not because I think AI would be particularly good at it, but because I think humans generally do a very poor job of it on average so it might be a low enough bar.

bought Ṁ20 of YES

@JacobPfau In case the link doesn't work in the future:

This was a little tongue-in-cheek, but I definitely don't think it's unfalsifiable. We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews.

What if LLM found more errors in papers than peer reviewers did?

@NathanpmYoung That would count in favor of the LM but other factors would also be considered https://twitter.com/richardmcngo/status/1640926536705671168?s=46&t=gHwoO3eGDc6sgu1SSV-32w

Does it need to be a single neural net, or can it be an AI that's built from several neural nets working together? See https://ought.org/elicit

@YoavTzfati Unless someone voices a compelling reason against, id somewhat arbitrarily say a single NN (which may be called in various ways). Things like speculative sampling still count as single NN.

predicts YES

@JacobPfau Not sure I got it 100%. If there's a structured, automated process consisting of 10 calls to the same NN, does that count?

@YoavTzfati That counts.

bought Ṁ50 of NO

For the record, my object-level prediction on this is ~24% (e.g., survey results of ICLR authors about the quality of peer reviews of humans versus peer reviews of the best AI model), though my prediction for how Jacob, Richard, etc. interpret it at the end of 2025 is substantially higher.

Most of that ~24% is hard takeoff (i.e., an AI that has progressed so quickly that this question is pretty uninteresting and unimportant).

Any restrictions on the academic field? Or does this require that models can comprehend papers in every area of maths?

@WaddleBuddy Good question, since I'm (and presumably Richard) only familiar with ML, let's go with ICLR's median reviewer. As publicly estimate-able here.

Evaluation should be done on papers which are not in the model's training set (in case it saw reviews).