Will there be an advance in LLMs comparable to chain of thought this year?

230Ṁ335

resolved Jan 3

Resolved

ALL

Does not have to be a prompting technique. Must result in an actual improvement in SOTA on some interesting set of tasks.

Technical AI Timelines

New Year's Resolutions 2024

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ110
2		Ṁ13
3		Ṁ8
4		Ṁ2
5		Ṁ2

People are also trading

6 months from now will I judge that LLMs had already peaked by Nov 2024?

11% chance

Will we get a new LLM paradigm by EOY?

31% chance

Will there be major breakthrough in LLM Continual Learning before 2026?

26% chance

When will the next paradigm in LLMs (after reasoning) be released?

Will LLMs be the best reasoning models on these dates?

Will the leading LLM at the beginning of 2026 still be subject to the reversal curse?

46% chance

Will Apple release its own LLM on par with state of the art LLMs before 2026?

3% chance

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

73% chance

Will one of the major LLMs be capable of continual lifelong learning (learning from inference runs) by EOY 2025?

26% chance

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

Sort by:

As far as I can tell the answer is no. I will leave this market unresolved for a few days so people have time to submit evidence.

Comparable in terms of what? Is multi-benchmark performance from base few-shot eval supposed to increase roughly as much as CoT does? Does it need to improve over CoT as well?

@JacobPfau Produces a similar performance boost across a similar number of benchmarks. If the advance is as good a CoT when not using CoT I will accept it (even if when combined with CoT the resulting performance increase is <= sum of the separate increases)

Does Reflexion count? https://arxiv.org/abs/2303.11366

@jonsimon Currently no. If this gets replicated a few times across more diverse datasets it could. The HotPotQA results are about the right size of improvement to count, assuming it holds up under scrutiny (which I'm somewhat skeptical of, this paper doesn't really look like evaluation due diligence to me)