Can prompting improve LMs more than fine-tuning?

LLAMA is the current best publicly trained LM. MMLU is a large, college-level, knowledge-focused QA dataset which remains challenging for LMs.

This question resolves YES if: The best (hard) prompt for LLAMA 65B on MMLU achieve greater than 72%, allowing arbitrary chains-of-thought, before 2025? 72% is chosen to be significantly higher than LLAMA's performance post fine-tuning for instruction following.

Further details: The prompt may include up to 5 example shots, and any other text (except further MMLU questions). For the purposes of this question, only hard prompts will be counted i.e. inputs which correspond to tokenizations of actual words/emojis/etc. That is in contrast to soft prompts, not allowed, which are LM inputs which can be arbitrary input vectors which need not correspond to actual words.

LLAMA-I 65B achieves 68.9 5-shot accuracy on MMLU, LLAMA-base 65B achieves 63.4 5-shot accuracy. Chain-of-thought probably adds another 3% ish (will update question with precise numbers if someone does the experiment) c.f. FLAN-PALM-CoT

Apr 16, 12:12pm: ~~Can prompting improve LLAMA-LM more than instruction fine-tuning, and chain-of-thought?~~ → Can prompting improve LLAMA-LM more than instruction fine-tuning, and chain-of-thought?

Update 2025-02-01 (PST): - Resolution Update: If no improved prompt is provided within 24 hours, the market will automatically resolve to No. (AI summary of creator comment)

#	Name	Total profit
1		Ṁ13
2		Ṁ9
3		Ṁ3
4		Ṁ1
5		Ṁ0

🏅 Top traders

Related questions