Is GPT-4's human exam performance mainly due to memorisation?
resolved Jul 21

Resolves positive if convincing analysis comes out that GPT-4 is not much better than GPT-3.5 on human exams that are known to not appear in the training data, and negative if convincing analysis comes out showing the opposite.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
Sort by:
predicted NO

Finally, a resolution source

“To ensure an unbiased evaluation, we carefully curate questions that are not readily accessible online and couldn’t be easily extracted and transformed into text”

“GPT4 surpassed GPT3.5 by a significant margin across all seven experimental settings”

predicted NO

This evaluation shows an advantage (11 percentage points) for gpt4, but does not put contamination beyond doubt

This evaluation is unlikely to be contaminated and shows an advantage for gpt4 even if it’s not a strong result overall (0pts to 4pts). Whether it’s “much” better is debatable

Holding off resolution for now

predicted NO

I tested GPT3.5 on the multiple choice questions here, and it got 11/20 vs 17/20 for gpt4

I’ll try to find 1 more comparison before I resolve

predicted NO

More evidence:

Caplan's midterm is probably too recent for the training data.

I don't actually think this question is a difficult one, it just seemed easier to make it than to argue the point on twitter.

bought Ṁ25 of NO

Some weak-ish evidence: I compared 4's performance on the easiest level of the Japanese Language Proficiency Test to 3.5's. I doubt there's any way to prove conclusively whether the practice exam I used was in the training data of either model, but it might have been hard for OpenAI's robots to harvest it due to only being available in a non-OCR'd PDF as far as I know.

My impression is that 4 is just a lot more comfortable with the multiple-choice question format, especially fill-in-the-blank style questions.