
Resolves positive if convincing analysis comes out that GPT-4 is not much better than GPT-3.5 on human exams that are known to not appear in the training data, and negative if convincing analysis comes out showing the opposite.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ26 | |
2 | Ṁ25 | |
3 | Ṁ20 | |
4 | Ṁ13 | |
5 | Ṁ9 |
People are also trading
Finally, a resolution source https://arxiv.org/abs/2307.10635
“To ensure an unbiased evaluation, we carefully curate questions that are not readily accessible online and couldn’t be easily extracted and transformed into text”
“GPT4 surpassed GPT3.5 by a significant margin across all seven experimental settings”
This evaluation shows an advantage (11 percentage points) for gpt4, but does not put contamination beyond doubt https://arxiv.org/pdf/2303.17003.pdf
This evaluation is unlikely to be contaminated and shows an advantage for gpt4 even if it’s not a strong result overall (0pts to 4pts). Whether it’s “much” better is debatable https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/
Holding off resolution for now
I tested GPT3.5 on the multiple choice questions here, and it got 11/20 vs 17/20 for gpt4 https://scottaaronson.blog/?p=7209
I’ll try to find 1 more comparison before I resolve
More evidence: https://twitter.com/MatthewJBar/status/1636082863362961408
Caplan's midterm is probably too recent for the training data.
I don't actually think this question is a difficult one, it just seemed easier to make it than to argue the point on twitter.
Some weak-ish evidence: I compared 4's performance on the easiest level of the Japanese Language Proficiency Test to 3.5's. I doubt there's any way to prove conclusively whether the practice exam I used was in the training data of either model, but it might have been hard for OpenAI's robots to harvest it due to only being available in a non-OCR'd PDF as far as I know.
My impression is that 4 is just a lot more comfortable with the multiple-choice question format, especially fill-in-the-blank style questions.