Continuation of:
https://manifold.markets/PatrickDelaney/will-ai-achieve-significantly-highe
I reserve the right to change the metrics if they have grown stale in the above. Aim to get this finalized by end of January 2024.
The currently available flagship models (PaLM 2, GPT-4, and Gemini Pro) have not yet been evaluated. As far as I can tell, the largest model is the original PaLM, not PaLM 2. Additionally, it is GPT-3, not GPT-4V which is being evaluated. You can verify this in their published paper.
This is because GPT-4 stated in their technical report that they are not evaluating using BIG-Bench because "portions of BIG-Bench were inadvertently mixed into the training set..." (pg. 6).
Given that the question is trying to gauge whether advances in AI this year are significantly higher with respect to "general conceptual skills", I would argue we need a new metric which includes the current state of the art models.
I don't think you can fairly resolve this market by carrying over the old metric of achieving a 60 on the BIG-Bench Lite to another test. I propose resolving this N/A and remaking this with the Massive Multitask Language Understanding.