
Current SotA is 54%. Does GPT-4 with Wolfram score >= 64%?
If the May 2023 version of GPT-4 with Wolfram becomes unavailable before anyone conducts this test, this question resolves N/A.
https://arxiv.org/pdf/2308.07921v1.pdf finds a 13%-ish improvement using code interpreter from a skim over it. That's with a significantly updated model compared to GPT-4 at time of question writing.
I've sold my stake in anticipation of having to resolve this question N/A -- to avoid conflict of interests. OAI has not specified when, but the docs specify the gpt-4-0314 may be removed at any time.
@JacobPfau FWIW my credence in a similar strategy to the above linked paper getting >10% performance boost out of gpt-4-0314 using Wolfram is around 50%.
Relevant previous work:
https://arxiv.org/pdf/2305.12524.pdf
https://arxiv.org/pdf/2211.12588.pdf
https://arxiv.org/pdf/2211.10435.pdf
AFAICT from skimming, none used wolfram for intermediate steps. Mostly Python. Also none evaluate on MATH.