Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2025?
5
200Ṁ96
2026
68%
chance

Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:

  • >= 75% macro-averaged zero-shot accuracy on MMLU

    • (Meta reports 73.0% as their score for this model size + eval)

  • >= 53% macro-averaged zero-shot accuracy on MATH

    • (Meta reports 51.0% as their score for this model size + eval)

This market will resolve NO at the end of 2025 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Get
Ṁ1,000
to start trading!

Related questions

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules