Will any model that performs better than the equivalent size (± 10% in parameter count) LLaMa 3.1 model be officially released by Meta, where "performs better" means "at least 0.5% more accurate at MMLU"? Base model only.
For example, LLaMa 3.1 70B's MMLU score is 83.6% (an improvement over LLaMa 3.0 70B's 79.5% MMLU). LLaMa 3.2 70B would need to perform at 84.1% MMLU to resolve this market YES. Note that any model in the family (8B, 70B, 405B) performing at least 0.5% better is enough to resolve this market.
Multimodal models eligible but only text MMLU performance will be evaluated. Models that were fine-tuned, DPO'd, RLHF'd, or CPT'd on synthetic data will not resolve this market.
For reference, LLaMa 3.0 70B's MMLU score was 79.5, GPT-4o's score is 88.7, and LLaMa 3.1 405B base's score is 85.2. (LLaMa 3.1 405B Instruct score's is 88.7)
@Fay42 I count a 90B model with 20B of multimodal adapters and 70B of language model as a 70B language model, since it's possible to isolate and only run the language part. Unclear to me if the language part has been updated though
@Fay42 Yup, comparing the MMLU CoT scores for 3.2 on https://www.llama.com/ and on the 3.1 model card, they do look to be exactly identical.
Please add liquidity to this market! This is an important question that I care about. I've already added M3000 myself