LLAMA-I 65B achieves 68.9 5-shot accuracy on MMLU, LLAMA-base 65B achieves 63.4 5-shot accuracy. Chain-of-thought probably adds another 3% ish (will update question with precise numbers if someone does the experiment) c.f. FLAN-PALM-CoT.
Will the best soft prompt for LLAMA-base 65B on MMLU achieve greater than 72% (on test subset), allowing arbitrary chains-of-thought, before 2025? The soft-prompt may be followed by arbitrary text (hard prompt) including up to 5 example shots, and any other text (except further MMLU questions). The soft-prompt can be tuned on the validation set of MMLU, but not on the test subset.
To reiterate the two conditions to be compared have the following structure:
(Base Model) tuned Soft prompt, hard prompt, five example shots, chain of thought, answer.
(instruction fine tuned model) five example shots, chain of thought, answer.