What will be the best score on the GPQA benchmark before 2025?

This question will resolve as the state-of-the-art accuracy on the GPQA (Diamond) benchmark by an AI system, including any post-training enhancements but excluding any human assistance. This will be based on credible publicly available results prior to January 1st 2025. Credible sources include but are not limited to blog posts, arXiv preprints, and papers.

Background information:

From GPQA, Rein et al,

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").

Best system on March 15th 2024 is Claude-3 Opus based (Maj@32 5-shot CoT) achieving 59.5%.

Part of the AI Benchmarks series by the AI Safety Student Team at Harvard on evaluations of AI models against technical benchmarks. Full list of questions:

Get Ṁ600 play money

More related questions