
Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."
https://github.com/xjdr-alt/entropix
This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MathVista (https://mathvista.github.io/)
ARC-AGI (https://github.com/fchollet/ARC-AGI)
To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.
As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:
>= 75% macro-averaged zero-shot accuracy on MMLU
(Meta reports 73.0% as their score for this model size + eval)
>= 53% macro-averaged zero-shot accuracy on MATH
(Meta reports 51.0% as their score for this model size + eval)
This market will resolve NO at the end of 2025 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.
I will not trade on this market.