I am presuming you wouldn't accept language models trained, fine-tuned or prompted to work with post-processors (such as by emitting python expressions to be evaluated and replaced in the output before further continuations are generated) since those already exist today, but what about other types of hybrid systems?
For example, if something similar to Memorizing Transformers was used, except instead of memorized past context the system injected into an intermediate layer what it predicted to be the most salient numeric computation results based on the current context, would that still count as a language model for purposes of resolution?
Or is your intent to explore the ability of pure LLMs to generalize, and so you would consider something like the above a cheat?
@ML Moreover, would ‘add these numbers by exptending the rule you have learned up to 5 digits’ count? Or only ‘add these one shot no additional instructions GO’ type of prompt be allowed?