By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.

"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).


This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").


Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.


2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).


GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.

Get Ṁ600 play money
Sort by:
predicts NO

@AlexMizrahi This market is not about identifying semantics inside an LLM. It is not about identifying semantics inside an LLM which we did not previously know to be inside LLMs. This market is about identifying semantics inside an LLM such that those semantics, once uncovered, teach us something about semantic representations in general which we did not know in 2006.

predicts NO

@MartinRandall What matters isn't the opacity of the discovering program, but whether the discovered result is semantically transparent to us.

@RyanMoulton Presumably not. They are not exactly looking at semantics here...

bought Ṁ50 YES

I see this question as a sort of dual to increase of capabilities that shocked the world with each new iteration of GPTs. With the progress of interpretability, it would be surprising to me if we weren't able to elucidate some of the deep reasons why they perform so well, and en passant, discovered something new and strange about linguistics and fourty other things.

bought Ṁ100 YES


Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

bought Ṁ1 YES at 46%

How would you resolve this market if researchers distil causal models from an LLM that are much better than causal models constructed by older means? Would it make a difference if they had, for example, a modestly different notion of “intervention” to existing causal models?

My guesses are “no” and “maybe” in that order

@DavidJohnston I think if we learn a different notion or more compact representation for interventions off studying LLMs, that definitely counts. In the former case I think I want to know more about "better"; if we just distilled knowledge in a known format that LLMs learned by inscrutable means, we have not found and understood a new algorithm.

Semantics are hard. And I think as we poke and prod LxMs, we will learn much more about how to update our thinking about how the human brain functions in re: semantics and linguistics. But I think the time horizon is farther out than 2026, because I don't think there is enough interpretibility between shape rotators and wordcels yet.

By the late 2020s,we might be able to decode something meaningful from current models (with the help of later models) - but the laser models may still be out of reach.

sold Ṁ564 NO

@EliezerYudkowsky Selling all shares in this market to avoid any appearance of conflict of interest in judging it.

Re: "Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model"

predicts YES

Another step: How do Language Models Bind Entities in Context?

Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.

predicts NO

@stuhlmueller afaict from the introduction, "binding" is sth GOFAI already did, e.g. via expressions like "lives(Alice,Paris)". So this is more a step towards finding all of GOFAI again (which might also help find novel insights!) than a direct step towards discovering novel cognitive algorithms.

predicts YES

So after working on interpretability for some time I've updated on favour of my impresion of interpretability progress going to be very fast in the next few years.

I think Eliezer will be predictably surprised one or two more times untill he updates and starts expecting faster progress.

The thing is that turns out mechanistic interpretability is not extremately hard(it is hard in an absolute semse but just a normal for a typical field of science hard) , and the lack of progress untill now is mostly due that except a few people like Chris Olah and his team it was basically not tried for years untill recently and a lot of the simplest ideas like using autoencoders to detect features just haven't been tried and tend to just work.

But fortunately this is changing and seems likely to change further, ML academics are likely to do more work on the field now that Anthropic Deepmind and things like the Rome paper are making it trendy, and Neel Nanda has siglelhandly gotten lots of people interested in the field.

It also seems like some unkowns about things like how models represent data might just resolve in the most favorable way where it's mostly linear representations and the features as directions.

(at least for transformers)

Now unfortunately I think this doesn't necesarily mean we will have good enough interpretability to know what we are doing on time, since that's a high bar, but understanding any specific subtasks that couldn't be done with GOFAI seems an easier subproblem that seems likely to be solved anyway.

I do get why it's a harder problem than understanding some already known algoritm, but new complicated stuff is made of smaller subcomponets we can undertand and it seems likely we'll have usefull advances in automated interpretability during 2024-2025.

predicts NO

@VictorLevoso sure, but would they find something interesting and insightful to make this market resolve Yes?

predicts YES

@ICRainbow I think that it depends partly on like what counts for the market and unknowns about how LLM actually work.

But models seem to be able to do stuff like explaining jokes that couldn't be easily be done with GOFAI that are relatively narrow task one could focus on and expend some time identifying the parts of the model involved and understanding it piece by piece and getting a lot of feedback, leading to people figuring out how the whole thing works.

Like duno it just seems like the kind of thing I expect to see in fe an Anthropic paper by 2026.

Not with more than the current probability of the market, but I'm mostly explaining why I've kept buying yes all this time and been the top yes bettor here, since my reasons have changed over time compared to my old comments, and some other yes coments seem like they are updating on the wrong reasons.

predicts YES

@VictorLevoso also if it is the case that if we got to the point were we completely undertand LLM and how they do all the things GOFAI couldn't do and the market wouldn't resolve yes then I feel like something was wrong with the market resolution criterion with respect of the spirit of the question, and Eliezer's notion of patterns of cognition not known in 2006 is probably just not coherent.

I feel like to the extent the question makes sense at all there is something in there that counts as novel enough.

predicts NO

@Abraxas How so? This doesn't seem very relevant to me.

So Math is really big. The set of "possible GOFAI-like problem solving techniques" is big and the set of "possible Statistical Learning-like problem solving techniques" is big.

The Question decomposes into parts:

  1. Do structures implementing novel techniques in studied LLMs exist?

  2. Will Mechanistic Interpretability researchers successfully notice confusion when faced with a novel technique? Weak Linguistic Relativity and the weak-form of Sapir-Whorf hypothesis might create a sufficient perception hurdle to inhibit discovery altogether - Will they just dismiss useful but alien structures as noise or "Junk DNA"?

  3. Even if novel technique structures are extant, studied and readily perceivable, Will they be perceived/written-up/reviewed/published before 31 Dec 2026?

Interpretability is that which we can interpret. I think #1 is possible, but I don't think 1, 2 and 3 are all True.

Diagram Linked from:

A schelling point to adjudicate "whose semantics would be unfamiliar" is to use the approach taken by patent law.

predicts NO

@AgenticLondoner Unfortunately the patents for "one-click" and such have lost much of the credibility of that coordination point.

predicts YES

It is the "whose semantics would be unfamiliar" that seems hard to resolve.

I recall back in the Overcoming Bias days a comment against one of Eliezer's own sequence posts to the effect that Einstein's work on general relativity was overrated and not that complex really because it was just a couple of tensors, using an extremely compact notation that was not how Einstein originally represented it (even though it looks like tensor calculus itself dates to 1890). Edit: As the previous can read ambiguously, let me provide a link and clarify that it was the commenter, not Eliezer, who wrote that Einstein's work was actually extremely simple.

I see Neel Nanda's work reverse-engineering the apparently-novel "Fourier multiplication" algorithm (invented by a small transformer trained to grok addition mod 113) has already been discussed below, and while I agree it doesn't seem to meet the resolution criteria, it seems illustrative of how one could always break this down and say it wasn't really new because it was just a DFT plus some glue, or whatever.

Would Eliezer say more about how he'd adjudicate "semantics would be unfamiliar" when resolving?

If we find another LLM inside a LLM, would that count?

Comment hidden

More related questions