MANIFOLD
Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?
98
แน€1.2kแน€21k
resolved Jan 6
Resolved
YES

Small number arithmetic in the training set is fine, as is non-arithmetic. "Small" and "large" are relative: if the training set contains arithmetic up to 20 digits and it generalizes to 100 digits, the question resolves yes. I'll accept a subset of arithmetic as well, e.g. if it can only do large number addition but not multiplication the question resolves yes.

  • Update 2026-01-06 (PST) (AI summary of creator comment): Thinking models (models that use chain-of-thought or extended reasoning) can satisfy the resolution criteria. The creator tested Claude Opus with 100-digit arithmetic and it succeeded, which they consider sufficient evidence that it wasn't extensively trained on large number arithmetic (as such training would be economically wasteful).

Market context
Get
แน€1,000
to start trading!

๐Ÿ… Top traders

#TraderTotal profit
1แน€4,548
2แน€731
3แน€315
4แน€207
5แน€202
Sort by:

bayesian I trusted you

Thinking models can do this - I tested Opus with 100 digit numbers and it had no issues. Of course I don't have confirmation that Claude isn't extensively trained on 100 digit arithmetic, but it would be a massive waste of money so it seems rather unlikely.

Clear yes, we are in 2026

opened aแน€50YES at 40% order

@Merchant @johnNZOy another stupid clanker note please

Interesting bet.

Are you looking for an effect like this? It seems plausible.

Here an AI model got modular arithmetic:

https://youtu.be/D8GOeCFFby4?si=PAy5Ydji00jJRCEl

But do you expect this will be tested and published?

As more large-number arithmetic examples are added to training sets;

also with the quality and diversity of training sets;

as time goes on,

it's likely most of these language model's developing to operate on larger-number computational tasks has a more adequate ability to generalize from past learning in 2025.

Technically it YES, as modern models are trained on large data, but we can't know if large number arithmetic were used

bought แน€150 YES

The reasoning behind answering "yes" is based on the idea that language models, especially those trained on large, diverse datasets, tend to generalize well to tasks they weren't explicitly trained for, especially when those tasks are closely related to the data they were trained on.

Even if a model doesn't directly train on large-number arithmetic, the combination of pattern recognition, generalization abilities, better training data, and advances in architecture let models to handle larger number operations with a high level of accuracy.

[deleted by author]

I love markets like this where people are just betting on what interpretation the creator will accept

bought แน€350 YES

Doesn't this neurips paper from a whole year ago just do this already? https://dl.acm.org/doi/10.5555/3737916.3741346

@spiderduckpig that's like a custom made architectural change that does the relevant part of the generalization for the LM (iiuc), doesn't seem to match the spirit of the market to me, but maybe creator disagrees, @vluzko

@Bayesian

I think we agree that a language model using the abacus encoding is clearly still a language model, so this is just about the spirit of the market, and the text of the market is a yes resolution.

I disagree that the abacus encoding is "doing" the generalization, just because an architectural change is helpful for a task doesn't mean it's invalid, or that it's some cheat or party trick you stick onto the transformer that just autosolves it. The transformer's MLPs are still doing the generalization, all abacus encoding is doing is indexing the input data in a different way. Positional encodings have always been something that can be tweaked in a transformer, and nowhere does the market say we have to use an absolute encoding LM or FIRE encoding LM or something to get the result, that's just artificially tightening the standard.

Nowhere does the market forbid using a new architecture or even a task-specific architecture. In fact I'd think an architectural innovation was assumed as a possibility by the market creator to be needed for a yes. After all, it says "any" language model, implying an intentionally broad scope for qualifying LMs.

That being said, there are some papers achieving length generalization with FIRE encodings, which are completely digit index-agnostic. https://arxiv.org/abs/2402.09371

@spiderduckpig This paper does not resolve the market yes but not because of the custom positional embedding (that part is fine, I don't understand why anyone thought it wouldn't be, minor modifications to architectures are... just normal ML? I didn't even specify a transformer let alone a specific positional embedding). The problem is that they didn't train a language model: they trained a set of decoder-only transformers on specific narrow tasks. It's plausible that if you did train or finetune a real language model based on this paper that it would work, but they didn't do that.

sold แน€350 NO

Fairly sure ChatGPT 5.2 Thinking Extended can do this now simply because they gave it more time to do chain of thought for longer workflows like excel

@spiderduckpig wut, no, GPT 5.2 has extensive training on large number arithmetic

@Bayesian Surely it would only have seen a sparse subset of all large digit number arithmetic?

bought แน€250 YES

@spiderduckpig yeah? it has only seen a sparse subset of all sentences, but it can write sentences? that is not what the market is asking

opened a แน€1,000 YES at 28% order

@Bayesian I will put a limit order

bought แน€150 NO

@spiderduckpig i'll bet at 50%

opened a แน€750 YES at 40% order
opened a แน€1,000 NO at 50% order

@spiderduckpig ig uh we provide liquidity for onlookers

bought แน€150 NO

The art of the deal

someone could test this with nanochat or similar

bought แน€500 NO

but nobody will

ยฉ Manifold Markets, Inc.โ€ขTermsโ€ขPrivacy