PaperBench is a benchmark open-sourced by OpenAI designed to evaluate the ability of AI agents to replicate state-of-the-art AI research papers from scratch. The papers are sourced from the ICML 2024 spotlight and oral tracks.
This market concerns the PaperBench Code-Dev variant, which specifically measures the agent's capability to generate the code required for replication, based on rubric evaluations.
The current State of the Art (SotA) reported in the paper for the Code-Dev variant is 43.4% (achieved by o1-high).

Figure 1 (above) from the paper illustrates the overall benchmark idea.
Why use the PaperBench Code-Dev metric?
The PaperBench Code-Dev variant is simpler and cheaper to evaluate than the full benchmark. The Code-Dev variant uses an LLM-based judge to apply the benchmark scoring metrics but does not execute the reproduction code itself (increasing variance but obviating the need for VMs with GPUs during evaluation).
Market Details
Source: This market resolves based on published data from the maintainers of the PaperBench benchmark (e.g., OpenAI Evals Team or designated successors) or credible third-party evaluations using the official benchmark configuration.
Metric: Average Replication Score (%) on the PaperBench Code-Dev variant, across the official set of benchmark papers.
Threshold Score: 75.0% Average Replication Score or greater.
Resolution Criterion This market resolves to YES if the State-of-the-Art (SotA) Average Replication Score on PaperBench Code-Dev is credibly reported to have surpassed 75.0% by 11:59 PM UTC 2025/12/31. Otherwise, the market resolves to NO.
Market Closing Date The market will close on January 15, 2026, to allow for potential reporting delays. It will resolve earlier if the YES condition (>=75.0% reported) is met and confirmed before this date.