YES = convinced not sketchy, otherwise resolves NO in a week
https://arxiv.org/abs/2306.08997
afaik GPT4 is not smart enough to solve every MIT undergrad math and CS problem, not even close, so initially skeptical
then saw sketchiness described in tweet: https://twitter.com/yoavgo/status/1669760558436872193
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ29 | |
2 | Ṁ17 | |
3 | Ṁ16 | |
4 | Ṁ14 | |
5 | Ṁ12 |
People are also trading
~~seeing as we all hold NO, anyone mind an early resolution?~~ actually, the description says i'll wait a week, so i'll just wait it out
@firstuserhere "That's not all. In our analysis of the few-shot prompts, we found significant leakage and duplication in the uploaded dataset, such that full answers were being provided directly to GPT 4 within the prompt for it to parrot out as its own."
I read through it - they really do claim that GPT-4 achieves a perfect score on all their undergrad math and CS problems (with prompt engineering).
Their prompt engineering section is quite fishy - they seem to be using GPT to generate answers, grading those answers with GPT, then iterating after having GPT modify the prompt to be better rated by GPT.
They are sparse on specific details. I'd say it is unlikely they are lying that per their grading system GPT-4 gets a perfect score, but likely their method of determining score is different than you or I would typically assume.