Richard Ngo predicted this here.
Exact criterion is a WIP. If you ask in the comments I can clarify. I will try to defer to Richard insofar as he offers elaboration.
Tentative criterion: Ask Richard if he believes this has been passed in Jan 2026, resolve to his answer. Will ask him about comparison to the median reviewer for a recent ICLR.
Related questions
Some more detail on how comparison can be done here https://twitter.com/richardmcngo/status/1640926536705671168?s=46&t=gHwoO3eGDc6sgu1SSV-32w
@JacobPfau In case the link doesn't work in the future:
This was a little tongue-in-cheek, but I definitely don't think it's unfalsifiable. We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews.
@NathanpmYoung That would count in favor of the LM but other factors would also be considered https://twitter.com/richardmcngo/status/1640926536705671168?s=46&t=gHwoO3eGDc6sgu1SSV-32w
Does it need to be a single neural net, or can it be an AI that's built from several neural nets working together? See https://ought.org/elicit
@YoavTzfati Unless someone voices a compelling reason against, id somewhat arbitrarily say a single NN (which may be called in various ways). Things like speculative sampling still count as single NN.
@JacobPfau Not sure I got it 100%. If there's a structured, automated process consisting of 10 calls to the same NN, does that count?
For the record, my object-level prediction on this is ~24% (e.g., survey results of ICLR authors about the quality of peer reviews of humans versus peer reviews of the best AI model), though my prediction for how Jacob, Richard, etc. interpret it at the end of 2025 is substantially higher.
Most of that ~24% is hard takeoff (i.e., an AI that has progressed so quickly that this question is pretty uninteresting and unimportant).
@WaddleBuddy Good question, since I'm (and presumably Richard) only familiar with ML, let's go with ICLR's median reviewer. As publicly estimate-able here.
Evaluation should be done on papers which are not in the model's training set (in case it saw reviews).