Is GPT-4's human exam performance mainly due to memorisation?
17
67
370
resolved Jul 21
Resolved
NO

Resolves positive if convincing analysis comes out that GPT-4 is not much better than GPT-3.5 on human exams that are known to not appear in the training data, and negative if convincing analysis comes out showing the opposite.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ26
2Ṁ25
3Ṁ20
4Ṁ13
5Ṁ9
Sort by:
predicted NO

Finally, a resolution source https://arxiv.org/abs/2307.10635

“To ensure an unbiased evaluation, we carefully curate questions that are not readily accessible online and couldn’t be easily extracted and transformed into text”

“GPT4 surpassed GPT3.5 by a significant margin across all seven experimental settings”

predicted NO

This evaluation shows an advantage (11 percentage points) for gpt4, but does not put contamination beyond doubt https://arxiv.org/pdf/2303.17003.pdf

This evaluation is unlikely to be contaminated and shows an advantage for gpt4 even if it’s not a strong result overall (0pts to 4pts). Whether it’s “much” better is debatable https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/

Holding off resolution for now

predicted NO

I tested GPT3.5 on the multiple choice questions here, and it got 11/20 vs 17/20 for gpt4 https://scottaaronson.blog/?p=7209

I’ll try to find 1 more comparison before I resolve

predicted NO

More evidence: https://twitter.com/MatthewJBar/status/1636082863362961408

Caplan's midterm is probably too recent for the training data.

I don't actually think this question is a difficult one, it just seemed easier to make it than to argue the point on twitter.

bought Ṁ25 of NO

Some weak-ish evidence: I compared 4's performance on the easiest level of the Japanese Language Proficiency Test to 3.5's. I doubt there's any way to prove conclusively whether the practice exam I used was in the training data of either model, but it might have been hard for OpenAI's robots to harvest it due to only being available in a non-OCR'd PDF as far as I know.

My impression is that 4 is just a lot more comfortable with the multiple-choice question format, especially fill-in-the-blank style questions.