Can soft-prompting improve LLAMA-LM more than instruction fine-tuning, and chain-of-thought?

LLAMA-I 65B achieves 68.9 5-shot accuracy on MMLU, LLAMA-base 65B achieves 63.4 5-shot accuracy. Chain-of-thought probably adds another 3% ish (will update question with precise numbers if someone does the experiment) c.f. FLAN-PALM-CoT.

Will the best soft prompt for LLAMA-base 65B on MMLU achieve greater than 72% (on test subset), allowing arbitrary chains-of-thought, before 2025? The soft-prompt may be followed by arbitrary text (hard prompt) including up to 5 example shots, and any other text (except further MMLU questions). The soft-prompt can be tuned on the validation set of MMLU, but not on the test subset.

To reiterate the two conditions to be compared have the following structure:

(Base Model) tuned Soft prompt, hard prompt, five example shots, chain of thought, answer.
(instruction fine tuned model) five example shots, chain of thought, answer.

People are also trading

Related questions