Will entropy-based sampling improve Llama3.2 on reasoning benchmarks in 2024?

Entropy-based sampling (colloquially, "the Shrek-Frog Sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models](https://www.anthropic.com/news/claude-3-5-sonnet) to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

Resolution Criteria

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.2 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.2 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

MMLU (https://github.com/hendrycks/test)
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MATH (https://github.com/hendrycks/math/)
GPQA (https://arxiv.org/abs/2311.12022)
MathVista (https://mathvista.github.io/)
BBH (https://github.com/suzgunmirac/BIG-Bench-Hard)
AIW (https://github.com/LAION-AI/AIW)
ARC-AGI (https://github.com/fchollet/ARC-AGI)

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.2 3B with a single set of sampling parameters producing BOTH:

>= 65% macro-averaged 5-shot accuracy on MMLU
- (Meta reports 63.4% as their score for this model size + eval)
>= 50% macro-averaged zero-shot CoT accuracy on MATH
- (Meta reports 48.0% as their score for this model size + eval)

This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Shameless Yoink of 3.1 Market

Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?

82% chance. Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute." https://github.com/xjdr-alt/entropix This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are: MMLU (https://github.com/hendrycks/test) MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro) MATH (https://github.com/hendrycks/math/) GPQA (https://arxiv.org/abs/2311.12022) MathVista (https://mathvista.github.io/) BBH (https://github.com/suzgunmirac/BIG-Bench-Hard) AIW (https://github.com/LAION-AI/AIW) ARC-AGI (https://github.com/fchollet/ARC-AGI) To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results. As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH: >= 75% macro-averaged zero-shot accuracy on MMLU (Meta reports 73.0% as their score for this model size + eval) >= 53% macro-averaged zero-shot accuracy on MATH (Meta reports 51.0% as their score for this model size + eval) This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation. I will not trade on this market.

People are also trading

Related questions