Will any author of "Clever Hans or Neural Theory of Mind?" say that LLMs have some robust Theory-of-Mind (ToM) in 2026?

The recent paper https://arxiv.org/abs/2305.14763, named "Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models", checks claims of emergent "Theory of Mind" (ToM) reasoning in LLMs. They find some slightly modified versions of standard ToM benchmarks are too hard for current models, even GPT-4. These are often "adversarial" to humans too, but I find their examples valid in that I find it easy to find the correct answer to some of the questions at https://github.com/salavi/Clever_Hans_or_N-ToM when I focus a bit.

The results in the paper seem reasonable right now; this question is about whether the implications of that paper hold throughout 2026. A major point of concern is training on the test set or very in-distribution questions; hence, I do not want to tie the resolution to any particular benchmark that exist now, including their newly introduced tests.

Instead, in December 2026, I will send emails and reach out to all the authors of the paper, asking if there is evidence that any publicly known (but not necessarily publicly available) AI system showing signs of robust ToM, in the sense of robustly outperforming at least 20% of people in the world, or in a representative country.

The system can be a LLM or a similar oracle enhanced in any way; for example, prompting it with "You are a model exhibiting robust ToM and social reasoning. Question: ..." is fine for the purposes of the question, as are systems that query a model multiple times to make it more robust. However, during tests there should be no person in the loop.

Resolves YES if:

any author confirms that several of the implications they raise in the 2023 paper do not apply to some AI systems, at any point until 31 Dec 2026;
or, the reported results are changed or not trustworthy in a sense that points to a YES resolution, (e.g. there was a bug or strawmanning issue in the testing that significantly underestimates GPT-4 performance);

Resolves NO if:

- none of the above happens, and there is at least one author that replies in any way consistent with the NO resolution;

Question resolves N/A if:
- none of the authors reply in any way;

- I am not able to ask the authors myself, and no researcher I trust with resolving this question can do that for me;

Note: I believe the benchmark results in the paper are correctly reported. Based on their social media profiles and general reputations, I believe there is a good chance some of the authors will give honest answers in 2026.

Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models

The escalating debate on AI’s capabilities warrants developing reliable metrics to assess machine “intelligence”. Recently, many anecdotal examples were used to suggest that newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached c…

People are also trading

Related questions