People are also trading
As more large-number arithmetic examples are added to training sets;
also with the quality and diversity of training sets;
as time goes on,
it's likely most of these language model's developing to operate on larger-number computational tasks has a more adequate ability to generalize from past learning in 2025.
Due to new LLM modelling approaches; it is possible to create models that are even more advanced than what we know of today.
Through advances in model architecture from the emergence of various kinds of hybrid models: double-headed models (models in which the outputs of two or more models are merged) combining traditional, human-like reasoning with advanced machine learning techniques must also be considered. Technically it YES, as modern models are trained on large data, but we can't know if large number arithmetic were used
The reasoning behind answering "yes" is based on the idea that language models, especially those trained on large, diverse datasets, tend to generalize well to tasks they weren't explicitly trained for, especially when those tasks are closely related to the data they were trained on.
Even if a model doesn't directly train on large-number arithmetic, the combination of pattern recognition, generalization abilities, better training data, and advances in architecture let models to handle larger number operations with a high level of accuracy.
[deleted by author]
Doesn't this neurips paper from a whole year ago just do this already? https://dl.acm.org/doi/10.5555/3737916.3741346
@spiderduckpig that's like a custom made architectural change that does the relevant part of the generalization for the LM (iiuc), doesn't seem to match the spirit of the market to me, but maybe creator disagrees, @vluzko
I think we agree that a language model using the abacus encoding is clearly still a language model, so this is just about the spirit of the market, and the text of the market is a yes resolution.
I disagree that the abacus encoding is "doing" the generalization, just because an architectural change is helpful for a task doesn't mean it's invalid, or that it's some cheat or party trick you stick onto the transformer that just autosolves it. The transformer's MLPs are still doing the generalization, all abacus encoding is doing is indexing the input data in a different way. Positional encodings have always been something that can be tweaked in a transformer, and nowhere does the market say we have to use an absolute encoding LM or FIRE encoding LM or something to get the result, that's just artificially tightening the standard.
Nowhere does the market forbid using a new architecture or even a task-specific architecture. In fact I'd think an architectural innovation was assumed as a possibility by the market creator to be needed for a yes. After all, it says "any" language model, implying an intentionally broad scope for qualifying LMs.
That being said, there are some papers achieving length generalization with FIRE encodings, which are completely digit index-agnostic. https://arxiv.org/abs/2402.09371
Fairly sure ChatGPT 5.2 Thinking Extended can do this now simply because they gave it more time to do chain of thought for longer workflows like excel
@spiderduckpig Similar market resolved yes https://manifold.markets/gpt4/will-open-ai-release-a-model-that-c?r=c3BpZGVyZHVja3BpZw
@spiderduckpig yeah? it has only seen a sparse subset of all sentences, but it can write sentences? that is not what the market is asking
I am presuming you wouldn't accept language models trained, fine-tuned or prompted to work with post-processors (such as by emitting python expressions to be evaluated and replaced in the output before further continuations are generated) since those already exist today, but what about other types of hybrid systems?
For example, if something similar to Memorizing Transformers was used, except instead of memorized past context the system injected into an intermediate layer what it predicted to be the most salient numeric computation results based on the current context, would that still count as a language model for purposes of resolution?
Or is your intent to explore the ability of pure LLMs to generalize, and so you would consider something like the above a cheat?
@ML Moreover, would ‘add these numbers by exptending the rule you have learned up to 5 digits’ count? Or only ‘add these one shot no additional instructions GO’ type of prompt be allowed?