Entropy-based sampling (colloquially, "the shrek sampler") is a term for a new class of sampling methods for LLMs, intended to "simulate something similar to o1's CoT or [Anthropic's models] to get much better results using inference time compute."
https://github.com/xjdr-alt/entropix
This market will resolve YES if by the end of 2024, there has been a credible demonstration that applying entropy-based sampling to a Llama3.1 model (base or Instruct) leads to reasoning scores that are better than the baseline of the corresponding Llama3.1 Instruct model with regular chain-of-thought (CoT). Specifically, someone must post a comment on this market linking to credible evidence that entropy-based sampling method produces validation accuracies at least 2 percentage points higher than baseline across multiple reasoning benchmarks. Eligible "reasoning benchmarks" are:
MMLU-Pro (https://github.com/TIGER-AI-Lab/MMLU-Pro)
MathVista (https://mathvista.github.io/)
ARC-AGI (https://github.com/fchollet/ARC-AGI)
To establish the baseline to compare against, if Meta has not already published official scores on a given benchmark for that model w/ regular CoT, it is acceptable to produce unofficial scores and link them alongside the comparison results.
As an example, it would suffice to post a comment on this market linking to logs of Llama 3.1 8B with a single set of sampling parameters producing BOTH:
>= 75% macro-averaged zero-shot accuracy on MMLU
(Meta reports 73.0% as their score for this model size + eval)
>= 53% macro-averaged zero-shot accuracy on MATH
(Meta reports 51.0% as their score for this model size + eval)
This market will resolve NO at the end of 2024 otherwise. If there is substantial disagreement about the validity of the baseline or comparison scores, I will resolve the market based on my best understanding of the situation.
I will not trade on this market.
@deepfates If you don't expect anyone will evaluate this on any of the larger models too by the resolution date, then that would be a reason to bet NO on this market.
@CharlesFoster when I made this comment the market was at 30%, I would have bet YES if you had made the question more clear. Now that we have more information I'm going to bet on software development taking longer than predicted, as is base rate
For anyone reading this, there's more context in the linked thread
https://x.com/_xjdr/status/1844451260117991424?t=hYC5sl7wn9TZ0JUpqghRpQ&s=19
@deepfates The models specified in the market name and resolution criteria have not changed since I first opened this. Apologies that there was a misunderstanding around that. If you need clarification, please let me know.
@deepfates Easily solved. If it's as promising as it looks, 3.1 70B/405B benchmarks are inevitable. https://manifold.markets/alexkropivny/will-entropybased-sampling-improve-6g57p3paxs