Will entropy-based sampling improve Llama3.2 on reasoning benchmarks in 2024?
Will entropy-based sampling improve Llama3.2 on reasoning benchmarks in 2024?
25
1kṀ11k
Jan 1
39%
chance

Entropy-based sampling (colloquially, "the Shrek-Frog Sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models](https://www.anthropic.com/news/claude-3-5-sonnet) to get much better results using inference time compute."

https://github.com/xjdr-alt/entropix

Resolution Criteria

This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.2 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.2 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:

To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.

As an example, it would suffice to post a comment on this market linking to logs of Llama 3.2 3B with a single set of sampling parameters producing BOTH:

  • >= 65% macro-averaged 5-shot accuracy on MMLU

    • (Meta reports 63.4% as their score for this model size + eval)

  • >= 50% macro-averaged zero-shot CoT accuracy on MATH

    • (Meta reports 48.0% as their score for this model size + eval)

This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.

I will not trade on this market.

Shameless Yoink of 3.1 Market

Will entropy-based sampling improve Llama3.1 on reasoning benchmarks in 2024?
82% chance. Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute." https://github.com/xjdr-alt/entropix This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are: MMLU (https://github.com/hendrycks/test) MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro) MATH (https://github.com/hendrycks/math/) GPQA (https://arxiv.org/abs/2311.12022) MathVista (https://mathvista.github.io/) BBH (https://github.com/suzgunmirac/BIG-Bench-Hard) AIW (https://github.com/LAION-AI/AIW) ARC-AGI (https://github.com/fchollet/ARC-AGI) To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results. As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH: >= 75% macro-averaged zero-shot accuracy on MMLU (Meta reports 73.0% as their score for this model size + eval) >= 53% macro-averaged zero-shot accuracy on MATH (Meta reports 51.0% as their score for this model size + eval) This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation. I will not trade on this market.
Get
Ṁ1,000
to start trading!

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Or create your own play-money betting market on any question you care about.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like betting still use Manifold to get reliable news.
ṀWhy use play money?
Mana (Ṁ) is the play-money currency used to bet on Manifold. It cannot be converted to cash. All users start with Ṁ1,000 for free.
Play money means it's much easier for anyone anywhere in the world to get started and try out forecasting without any risk. It also means there's more freedom to create and bet on any type of question.
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules