From the abstract,
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy.
This question resolves to YES if a credible paper, blog post, or document of any kind indicates that at least some AI obtained a score of greater than 74.0% on the GPQA dataset before January 1st 2027, and NO otherwise. The result must be credible, and I will exclude results that appear to be the result of cheating: for example, results obtained by training on the test set.