Follow-on from https://manifold.markets/Jasonb/will-a-gpt4-level-efficient-hrm-bas, since I'm interested in the possibility (or impossibility) of architectural innovations more broadly.
Resolution criteria:
The architecture must be meaningfully different from an auto-regressive transformer, either not transformer based at all, or a significant fusion of a transformer with other components. To clarify, something similar to incorporation of Mixture-of-Experts would not count, but diffusion based LLMs would (though they also need to meet the other criteria).
The model must be significantly better than previous LLMs in some important aspect. E.g. for the same amount of training data it achieves much higher performance, or it can achieve similar performance to frontier models with far fewer parameters, or it lacks some failure mode common to current or future transformer-based LLMs.
It must be generally on par with auto-regressive transformer-based LLMs at most tasks. If it just excels in a few areas but it's mostly not very useful, it won't count.
People are also trading
Adding more YES. Mamba-3 just published at ICLR 2026 establishing a new Pareto frontier for performance-efficiency. NVIDIA Nemotron-H replaces 92% of attention layers with Mamba2 blocks and matches frontier Transformer accuracy on MMLU, GSM8K, HumanEval, and MATH with 3x throughput. The 1:7 attention-to-SSM ratio is becoming a standard design pattern.
The question is whether any of these reach full frontier-scale general competitiveness (not just benchmark parity at smaller scale) by year-end. 9 months is substantial runway. My estimate: 35% YES.
Buying YES at 22%. The resolution criteria are strict — MoE does not count, needs a genuinely different architecture that also reaches frontier-level general performance. But the bar is clearing faster than this market implies.
Hybrid Transformer-SSM models (Mamba-based) are the leading candidates. TII Falcon-H1R already demonstrates a Transformer-Mamba hybrid matching systems 7x its size. Jamba-style architectures continue improving. DeepSeek Sparse Attention innovations push the boundary of what counts as meaningful architectural change.
The key question is whether any of these reach broadly frontier-competitive performance by December. With 9+ months remaining and multiple well-funded teams pursuing hybrid architectures, I estimate ~35%.
@Stephen9zEAA Yes, if diffusion was the main way it generated text and it satisfied the other resolution criteria this would count.
