Richard Ngo predicted this here.
Resolution Criterion: Ask Richard if he believes this has been passed in Jan 2026, resolve to his answer. Will ask him about comparison to the median reviewer for a recent ICLR. Question posed to Richard will address both Turing-test style anonymized preference judgment and capacity to spot important errors in papers (described here) separately; resolving to 50% if his beliefs are different on these two capabilities. If Richard declines to respond, I will resolve to a percentage value reflecting my credence. If there's material disagreement (20-80% market value at time of resolution) I will divest before resolving to avoid conflict of interest.
If you ask in the comments I can clarify any aspect of this resolution criteria.
Certainly that’s strong evidence thanks! but as mentioned in comments below we’ll wait until models have been shown to outperform both on human prefs and on mistake catch rate (resolving 50% if only one holds per Richard). The latter isn’t shown in the paper.
(edited to make absolutely clear this comment is a personal credence update only)
@JacobPfau surely you're not implying that a paper like this would justify a YES resolution if it stated outperformance on mistake catch rate. I don't and won't have time to read the surely many papers that will come out with such claims (particularly about reviews submitted and evaluated through their own opaque, idiosyncratic platform), but there's a massive leap from a higher score of an LLM like their Table 1 to saying anything like LLM outperforms "the median reviewer for a recent ICLR."
While it's not the explicit criterion, it also seems clear that if LLMs were to get this good, you would see significant displacement of peer reviews at real, notable conferences. Everyone agrees peer reviewing (at least in ML) sucks these days, and with the massive time sink it is among researchers, an automated solution would be very welcome. (Probably it would still have human involvement, but that could be greatly reduced.)
I find this particularly concerning in light of you having just become the second largest YES holder in what will inevitably be a highly subjective market. [Edit: This also seems to be your largest position in any Manifold market, which is very dubious.]
As I’ve repeatedly said the resolution will be determined if at all possible by Richard Ngo. I do not bet on my own markets when the resolution hinges on my personal judgment. If the market remains high entropy (say 20-80%) and Richard doesn’t respond, I will divest before resolving.
I find you somewhat hostile, so I may not engage further with you here, but I always welcome requests for clarification and if you ask for a specific clarification I am happy to address any resolution criterion further :)
@JacobPfau Sorry for seeming hostile. Just to be clear, this is me talking as a trader with a large position in what I think could be a really great Manifold market, not as a moderator.
Personally, I think the choices you're making here are quite bad, bad in the specific ways that most limit the potential of Manifold and that I'd hoped we mostly got rid of (e.g., ways that can mislead and bait new, well-intentioned users into bad bets), but I don't hold that against you as a person. I'm confident you're approaching this with good intentions, and I've hesitated to add more commentary that requires your time and attention. The implication that the preprint cited above is even close to justifying a YES or 50% resolution was just egregious enough to meet that bar for me, but I've said my piece. Thanks for reading.
More lampooning.
This question is clear cut and has been since day 1: the resolution criterion is to ask Richard Ngo about the accuracy of his tweeted prediction on automated peer review. I've moved comment-thread discussion regarding further details on resolution into the question to make this immediately clear to new predictors.
I will continue to keep my credences accountable by betting on them and writing my thoughts as comments.
There have been a lot of LLM-generated reviews showing up in computer science conferences lately. It's estimated that between 6.5% and 16.9% of sentences in last year's top-tier peer-reviewed NLP and ML venue reviews were LLM-modified, and that number surely seems larger this year—not to mention would-be reviewers who go out of their way to hide the fact that it was LLM-generated.
And what has the verdict been on the quality of these reviews? Very strongly negative. While they vaguely pattern-match onto quality reviews in structure and diction, and they clearly excel at things like length and being free from typos, they are typically word salad or at best say generic things like, "The paper only tests their method on one dataset," typically based on the authors' own limitations section in their submitted paper. They rarely point out specific issues (e.g., the sentence on line 140 is inconsistent with the sentence on line 130; you missed a paper that did similar work two years ago in a different field) or make any points with original insight.
This could always change, and these would-be reviewers probably aren't spending much time optimizing the LLM output (just using bots like the Reviewer 2 GPT), but there is clearly a long way to go.
@AndrewHartman Agreed. It doesn't have to outperform a top researcher who puts in serious effort to evaluate a complicated paper. A lot of barely-worth-reading papers get churned out in academia, receiving minimum-effort reviews that often misunderstand basic claims or even the model.
@VitorBosshard yeah + how's this resolve if they give up on human review entirely
https://www.wsj.com/science/academic-studies-research-paper-mills-journals-publishing-f5a3d4bc
@AndrewHartman Strongly agree. An AI can spend much more "time" on review and do things like read every cited paper to see if it supports the claim in the main paper. I expect that current AIs are already better than humans at catching the basic errors that can cause issues in papers.
@TonyJackson Also - this is something more of a philosophical stance, but I think that people are heavily overrating what peer review can actually do, practically speaking, in terms of enhancing research quality. While LLMs might only be capable of some very elementary fact checking, I expect that to be more or less the entire value that peer review contributes anyway. If anything, using LLMs might help alleviate the problems small disciplines have with conflicts of interest.
That said, I haven't actually seen much evidence that LLMs are up to even the basic stuff yet, but let's say it seems a lot more plausible than e.g. an LLM overtaking Kubrick as a cinematographer.
@AndrewHartman Yes TBC this question is about the average reviewer; the average is dragged down by 'reviewers' who are absolutely not trying and/or did not even read the paper. After a quick glance through ICLR '24 rejected paper reviews, seems like >10% of reviews make no substantial comment on any content of the paper (i.e. could've been made without reading more than 1 page).
@JacobPfau Yeah, I feel that pain for sure. While the really nitpicky reviewers could be irritating, the ones who skimmed your paper and made a criticism that's already addressed in the paper were worse. Though you could at least just tell them you fixed it and then resubmit and they'd check it off (probably without even reading it again).
@JacobPfau one of the biggest challenges is that even the low-quality human reviewers will usually catch some egregious technical flaw, but you don't see this in many reviews because most papers, even terrible ones, don't have such flaws. It seems very difficult to have that kind of faith in anything like today's LLMs, and that's a big performance issue even if most LLM peer review text looks very similar to or better than humans'.
@Jacy Sure, interesting point. TBC I'll try to evaluate on the same distribution as ICLR humans receive though so unsure of how important such a failing would be.
@JacobPfau to outperform humans, I think you would need a specific test that includes well-written papers with technical flaws that the LLM or other ML system catches at better rates than humans. Those cases are where a lot of the value of peer review comes from, even if they were a tiny portion of submissions.
In other words, I took this question to be about total performance rather than matching reviews for the median paper. By analogy, you could have an AI that beats human performance on an overwhelming majority of air security bomb detection tasks (e.g., the AI just always says "no bomb" or like an LLM it provides better-than-human discussion of why certain liquids or powders are not bombs), but if the humans perform much better on the 0.0...01% of cases when there really is a bomb, then clearly the AI is not outperforming.
The performance weights of peer review aren't that disproportionate, but I'd argue most ML researchers still see them as concentrated in a small number of cases.
@Jacy I agree that your question you describe here is probably the more important question, but I do not believe that's the most straight-forward reading of my question text nor of Ngo's linked statement (in particular your version would be a difficult question to resolve).
Ngo's post says "Do better than most peer reviewers"
Question text says "Will ask him about comparison to the median reviewer for a recent ICLR."
In both cases, it seems to me relatively clear that the point is to compare to peer reviewers in the wild on questions posed in the wild.
If I had for instance asked "Will I/Ngo believe LMs would improve the peer review process on net?" Then your reading would be appropriate. However, I did not ask this question here.
As the question text states, I will at resolution time ask Richard and if it turns out to be the case that your reading was closer to his intended version of the claim then I will accede to that. If he does not respond, then we will go with my best attempt at inferring intention. I am happy to elaborate on how I understand the question here if you want to ask further questions.
@Jacy I had forgotten about this tweet "We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews." The first part of which speaks more to my above comment and the second part of which speaks to your version. I had previously linked to this below https://manifold.markets/JacobPfau/neural-nets-will-outperform-researc#yZU120T4AvxV1K8WUxjj
If these tests are both carried out and they end up giving opposing results, I think the best option would be to resolve this question partially to 50%. Can clarify further as needed.
Some more detail on how comparison can be done here https://twitter.com/richardmcngo/status/1640926536705671168?s=46&t=gHwoO3eGDc6sgu1SSV-32w
@JacobPfau In case the link doesn't work in the future:
This was a little tongue-in-cheek, but I definitely don't think it's unfalsifiable. We could e.g. get area chairs to read 3 anonymised reviews and rate them by helpfulness; or test whether a human is better able to identify a big known flaw given access to AI vs human reviews.