Entropy-based sampling (colloquially, "the Shrek-Frog Sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models](https://www.anthropic.com/news/claude-3-5-sonnet) to get much better results using inference time compute."
https://github.com/xjdr-alt/entropix
Resolution Criteria
This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.2 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.2 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MathVista (https://mathvista.github.io/)
ARC-AGI (https://github.com/fchollet/ARC-AGI)
To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.
As an example, it would suffice to post a comment on this market linking to logs of Llama 3.2 3B with a single set of sampling parameters producing BOTH:
>= 65% macro-averaged 5-shot accuracy on MMLU
(Meta reports 63.4% as their score for this model size + eval)
>= 50% macro-averaged zero-shot CoT accuracy on MATH
(Meta reports 48.0% as their score for this model size + eval)
This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.
I will not trade on this market.
Shameless Yoink of 3.1 Market