Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?
➕
Plus
62
Ṁ9256
Jan 1
53%
chance

Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:

  • >= 75% macro-averaged zero-shot accuracy on MMLU

    • (Meta reports 73.0% as their score for this model size + eval)

  • >= 53% macro-averaged zero-shot accuracy on MATH

    • (Meta reports 51.0% as their score for this model size + eval)

This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Get
Ṁ1,000
and
S3.00
Sort by:

Wait, I think I just blundered, you said 3.1 not 3.2…. SEL SEL SEL

what is the idea of this? is there an associated paper?

@CampbellHutcheson XJDR tweets

@CampbellHutcheson Where we're going there's no papers only vibes.

No one is testing this on Llama 3.1 right now it's being tested on Llama 3.2 which comes in the smaller 1B and 3B varieties that 3.2 does not. This market is not worth betting on until it's fixed

@deepfates If you don't expect anyone will evaluate this on any of the larger models too by the resolution date, then that would be a reason to bet NO on this market.

bought Ṁ500 NO

@CharlesFoster when I made this comment the market was at 30%, I would have bet YES if you had made the question more clear. Now that we have more information I'm going to bet on software development taking longer than predicted, as is base rate

For anyone reading this, there's more context in the linked thread

https://x.com/_xjdr/status/1844451260117991424?t=hYC5sl7wn9TZ0JUpqghRpQ&s=19

@deepfates The models specified in the market name and resolution criteria have not changed since I first opened this. Apologies that there was a misunderstanding around that. If you need clarification, please let me know.

@deepfates Easily solved. If it's as promising as it looks, 3.1 70B/405B benchmarks are inevitable. https://manifold.markets/alexkropivny/will-entropybased-sampling-improve-6g57p3paxs

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules