This paper was recently posted to the ArXiV: https://arxiv.org/abs/2503.21934
It claims that SOTA LLMs achieved surprisingly low scores on this year's USAMO, achieving less that 5% on average.
In some discussions of this paper I've seen AI defenders claim that the paper is fake.
This weekend I will look into the paper's methodology, try to recreate their results with the models I have access to (Deepseek, o3 mini, Claude 3.7 thinking) if things are still unclear, and determine whether I think the paper's results are substantially true.
Possible resolutions are:
100%, the paper's results seem basically correct.
80%, the main thrust is correct but it seems like models performed particularly badly in their tests or they graded unnecessarily harshly.
50%, I am more confused than I am now and don't form an internal consensus.
20%, the paper's results are substantially, but not wholly, incorrect in my view.
0%, this seems like a fake paper to me/is completely wrong
My credentials and current epistemic status:
Former USAMO competitor and current PhD student in math.
Significant AI skeptic compared to most of Manifold, but probably not compared to general population.
The results of this paper were surprising to me, I would have expected much better performance.
Since the resolution criteria are subjective, I will not trade in this market.