Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?

1kṀ30k

resolved Jan 1

Resolved

ALL

Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

MMLU (https://github.com/hendrycks/test)
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MATH (https://github.com/hendrycks/math/)
GPQA (https://arxiv.org/abs/2311.12022)
MathVista (https://mathvista.github.io/)
BBH (https://github.com/suzgunmirac/BIG-Bench-Hard)
AIW (https://github.com/LAION-AI/AIW)
ARC-AGI (https://github.com/fchollet/ARC-AGI)

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:

>= 75% macro-averaged zero-shot accuracy on MMLU
- (Meta reports 73.0% as their score for this model size + eval)
>= 53% macro-averaged zero-shot accuracy on MATH
- (Meta reports 51.0% as their score for this model size + eval)

This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ2,995
2		Ṁ1,443
3		Ṁ1,096
4		Ṁ658
5		Ṁ487

People are also trading

Will entropix/entropy-baed sampling hold up well in 2025?

49% chance

Will the state-of-the-art AI model use latent space to reason by 2026?

15% chance

Conditional on OpenAI releasing an open-source LLM in 2025, will it exceed o3-mini in AIME 2025 score?

65% chance

Will the best LLM in 2026 have <1 trillion parameters?

40% chance

Will the best LLM in 2027 have <1 trillion parameters?

26% chance

Will LLMs' loss function achieve the level of entropy of human text by the end of 2030?

61% chance

Will any LLM produce a reasonable poker simulation, as judged by Nate Silver, by the end of 2028?

54% chance

Will reinforcement learning overtake LMs on math before 2028?

70% chance

Will the best LLM in 2027 have <250 billion parameters?

12% chance

Will an open source LLM (Vicuna, Alpaca, etc.) exceed ChatGPT interest by 2030?

Sort by:

2024 is over. The criteria for a YES resolution included:

Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks.

There is only one link to evidence that has been posted on this market (here: https://manifold.markets/CharlesFoster/will-entropybased-sampling-improve#kqoo3546mr), for a model and benchmark that are ineligible, specifically results for a Llama 3.2 model on the GSM8K benchmark.

Seeing no other posted evidence, I am resolving this NO.

I cloned this market and extended the closing date to next year:

"at least 2 percentage points higher than baseline across multiple reasoning benchmarks." are these @CharlesFoster are these relative or absolute percentage points?

@Waaaaaa 2 percentage points higher means “10% vs a baseline of 8%” or “52% vs a baseline of 50%”, etc.

@menhguin if you'd like to buy more at low price, let me know, I can place a large limit order for 'NO' then you can buy 'YES' shares.

https://x.com/rasdani_/status/1850875776062603755?t=o6cRQ9uI3AoftARlEBAl4Q&s=19

Wait, I think I just blundered, you said 3.1 not 3.2…. SEL SEL SEL

what is the idea of this? is there an associated paper?

@CampbellHutcheson XJDR tweets

@CampbellHutcheson Where we're going there's no papers only vibes.

No one is testing this on Llama 3.1 right now it's being tested on Llama 3.2 which comes in the smaller 1B and 3B varieties that 3.2 does not. This market is not worth betting on until it's fixed

@deepfates If you don't expect anyone will evaluate this on any of the larger models too by the resolution date, then that would be a reason to bet NO on this market.

bought Ṁ500 NO

@CharlesFoster when I made this comment the market was at 30%, I would have bet YES if you had made the question more clear. Now that we have more information I'm going to bet on software development taking longer than predicted, as is base rate

For anyone reading this, there's more context in the linked thread

https://x.com/_xjdr/status/1844451260117991424?t=hYC5sl7wn9TZ0JUpqghRpQ&s=19

@deepfates The models specified in the market name and resolution criteria have not changed since I first opened this. Apologies that there was a misunderstanding around that. If you need clarification, please let me know.