Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2025?

200Ṁ96

2026

68%

chance

ALL

Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

MMLU (https://github.com/hendrycks/test)
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MATH (https://github.com/hendrycks/math/)
GPQA (https://arxiv.org/abs/2311.12022)
MathVista (https://mathvista.github.io/)
BBH (https://github.com/suzgunmirac/BIG-Bench-Hard)
AIW (https://github.com/LAION-AI/AIW)
ARC-AGI (https://github.com/fchollet/ARC-AGI)

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:

>= 75% macro-averaged zero-shot accuracy on MMLU
- (Meta reports 73.0% as their score for this model size + eval)
>= 53% macro-averaged zero-shot accuracy on MATH
- (Meta reports 51.0% as their score for this model size + eval)

This market will resolve NO at the end of 2025 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Technical AI Timelines

Math

LLMs

Machine Learning

Get

1,000

to start trading!

Comments

5 Holders

5 Trades