Small number arithmetic in the training set is fine, as is non-arithmetic. "Small" and "large" are relative: if the training set contains arithmetic up to 20 digits and it generalizes to 100 digits, the question resolves yes. I'll accept a subset of arithmetic as well, e.g. if it can only do large number addition but not multiplication the question resolves yes.
Update 2026-01-06 (PST) (AI summary of creator comment): Thinking models (models that use chain-of-thought or extended reasoning) can satisfy the resolution criteria. The creator tested Claude Opus with 100-digit arithmetic and it succeeded, which they consider sufficient evidence that it wasn't extensively trained on large number arithmetic (as such training would be economically wasteful).
๐ Top traders
| # | Trader | Total profit |
|---|---|---|
| 1 | แน4,548 | |
| 2 | แน731 | |
| 3 | แน315 | |
| 4 | แน207 | |
| 5 | แน202 |
People are also trading
Interesting bet.
Are you looking for an effect like this? It seems plausible.
Here an AI model got modular arithmetic:
https://youtu.be/D8GOeCFFby4?si=PAy5Ydji00jJRCEl
But do you expect this will be tested and published?
As more large-number arithmetic examples are added to training sets;
also with the quality and diversity of training sets;
as time goes on,
it's likely most of these language model's developing to operate on larger-number computational tasks has a more adequate ability to generalize from past learning in 2025.
Technically it YES, as modern models are trained on large data, but we can't know if large number arithmetic were used
The reasoning behind answering "yes" is based on the idea that language models, especially those trained on large, diverse datasets, tend to generalize well to tasks they weren't explicitly trained for, especially when those tasks are closely related to the data they were trained on.
Even if a model doesn't directly train on large-number arithmetic, the combination of pattern recognition, generalization abilities, better training data, and advances in architecture let models to handle larger number operations with a high level of accuracy.
[deleted by author]
Doesn't this neurips paper from a whole year ago just do this already? https://dl.acm.org/doi/10.5555/3737916.3741346
@spiderduckpig that's like a custom made architectural change that does the relevant part of the generalization for the LM (iiuc), doesn't seem to match the spirit of the market to me, but maybe creator disagrees, @vluzko
I think we agree that a language model using the abacus encoding is clearly still a language model, so this is just about the spirit of the market, and the text of the market is a yes resolution.
I disagree that the abacus encoding is "doing" the generalization, just because an architectural change is helpful for a task doesn't mean it's invalid, or that it's some cheat or party trick you stick onto the transformer that just autosolves it. The transformer's MLPs are still doing the generalization, all abacus encoding is doing is indexing the input data in a different way. Positional encodings have always been something that can be tweaked in a transformer, and nowhere does the market say we have to use an absolute encoding LM or FIRE encoding LM or something to get the result, that's just artificially tightening the standard.
Nowhere does the market forbid using a new architecture or even a task-specific architecture. In fact I'd think an architectural innovation was assumed as a possibility by the market creator to be needed for a yes. After all, it says "any" language model, implying an intentionally broad scope for qualifying LMs.
That being said, there are some papers achieving length generalization with FIRE encodings, which are completely digit index-agnostic. https://arxiv.org/abs/2402.09371
@spiderduckpig This paper does not resolve the market yes but not because of the custom positional embedding (that part is fine, I don't understand why anyone thought it wouldn't be, minor modifications to architectures are... just normal ML? I didn't even specify a transformer let alone a specific positional embedding). The problem is that they didn't train a language model: they trained a set of decoder-only transformers on specific narrow tasks. It's plausible that if you did train or finetune a real language model based on this paper that it would work, but they didn't do that.
Fairly sure ChatGPT 5.2 Thinking Extended can do this now simply because they gave it more time to do chain of thought for longer workflows like excel
@spiderduckpig Similar market resolved yes https://manifold.markets/gpt4/will-open-ai-release-a-model-that-c?r=c3BpZGVyZHVja3BpZw
@spiderduckpig yeah? it has only seen a sparse subset of all sentences, but it can write sentences? that is not what the market is asking