By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.

"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).


This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").


Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.


2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).


GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.

Get Ṁ500 play money

Related questions

Sort by:
stuhlmueller avatar
Andreas Stuhlmüllerpredicts YES

Another step: How do Language Models Bind Entities in Context?

Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.

Joern avatar
Jörnpredicts NO

@stuhlmueller afaict from the introduction, "binding" is sth GOFAI already did, e.g. via expressions like "lives(Alice,Paris)". So this is more a step towards finding all of GOFAI again (which might also help find novel insights!) than a direct step towards discovering novel cognitive algorithms.

VictorLevoso avatar
Victor Levosopredicts YES

So after working on interpretability for some time I've updated on favour of my impresion of interpretability progress going to be very fast in the next few years.

I think Eliezer will be predictably surprised one or two more times untill he updates and starts expecting faster progress.

The thing is that turns out mechanistic interpretability is not extremately hard(it is hard in an absolute semse but just a normal for a typical field of science hard) , and the lack of progress untill now is mostly due that except a few people like Chris Olah and his team it was basically not tried for years untill recently and a lot of the simplest ideas like using autoencoders to detect features just haven't been tried and tend to just work.

But fortunately this is changing and seems likely to change further, ML academics are likely to do more work on the field now that Anthropic Deepmind and things like the Rome paper are making it trendy, and Neel Nanda has siglelhandly gotten lots of people interested in the field.

It also seems like some unkowns about things like how models represent data might just resolve in the most favorable way where it's mostly linear representations and the features as directions.

(at least for transformers)

Now unfortunately I think this doesn't necesarily mean we will have good enough interpretability to know what we are doing on time, since that's a high bar, but understanding any specific subtasks that couldn't be done with GOFAI seems an easier subproblem that seems likely to be solved anyway.

I do get why it's a harder problem than understanding some already known algoritm, but new complicated stuff is made of smaller subcomponets we can undertand and it seems likely we'll have usefull advances in automated interpretability during 2024-2025.

ICRainbow avatar
IC Rainbowpredicts NO

@VictorLevoso sure, but would they find something interesting and insightful to make this market resolve Yes?

VictorLevoso avatar
Victor Levosopredicts YES

@ICRainbow I think that it depends partly on like what counts for the market and unknowns about how LLM actually work.

But models seem to be able to do stuff like explaining jokes that couldn't be easily be done with GOFAI that are relatively narrow task one could focus on and expend some time identifying the parts of the model involved and understanding it piece by piece and getting a lot of feedback, leading to people figuring out how the whole thing works.

Like duno it just seems like the kind of thing I expect to see in fe an Anthropic paper by 2026.

Not with more than the current probability of the market, but I'm mostly explaining why I've kept buying yes all this time and been the top yes bettor here, since my reasons have changed over time compared to my old comments, and some other yes coments seem like they are updating on the wrong reasons.

VictorLevoso avatar
Victor Levosopredicts YES

@VictorLevoso also if it is the case that if we got to the point were we completely undertand LLM and how they do all the things GOFAI couldn't do and the market wouldn't resolve yes then I feel like something was wrong with the market resolution criterion with respect of the spirit of the question, and Eliezer's notion of patterns of cognition not known in 2006 is probably just not coherent.

I feel like to the extent the question makes sense at all there is something in there that counts as novel enough.

EliTyre avatar
Eli Tyrepredicts NO

@Abraxas How so? This doesn't seem very relevant to me.

AgenticLondoner avatar

So Math is really big. The set of "possible GOFAI-like problem solving techniques" is big and the set of "possible Statistical Learning-like problem solving techniques" is big.

The Question decomposes into parts:

  1. Do structures implementing novel techniques in studied LLMs exist?

  2. Will Mechanistic Interpretability researchers successfully notice confusion when faced with a novel technique? Weak Linguistic Relativity and the weak-form of Sapir-Whorf hypothesis might create a sufficient perception hurdle to inhibit discovery altogether - Will they just dismiss useful but alien structures as noise or "Junk DNA"?

  3. Even if novel technique structures are extant, studied and readily perceivable, Will they be perceived/written-up/reviewed/published before 31 Dec 2026?

Interpretability is that which we can interpret. I think #1 is possible, but I don't think 1, 2 and 3 are all True.

Diagram Linked from:

AgenticLondoner avatar

A schelling point to adjudicate "whose semantics would be unfamiliar" is to use the approach taken by patent law.

MartinRandall avatar
Martin Randallpredicts NO

@AgenticLondoner Unfortunately the patents for "one-click" and such have lost much of the credibility of that coordination point.

ML avatar
MLpredicts YES

It is the "whose semantics would be unfamiliar" that seems hard to resolve.

I recall back in the Overcoming Bias days a comment against one of Eliezer's own sequence posts to the effect that Einstein's work on general relativity was overrated and not that complex really because it was just a couple of tensors, using an extremely compact notation that was not how Einstein originally represented it (even though it looks like tensor calculus itself dates to 1890). Edit: As the previous can read ambiguously, let me provide a link and clarify that it was the commenter, not Eliezer, who wrote that Einstein's work was actually extremely simple.

I see Neel Nanda's work reverse-engineering the apparently-novel "Fourier multiplication" algorithm (invented by a small transformer trained to grok addition mod 113) has already been discussed below, and while I agree it doesn't seem to meet the resolution criteria, it seems illustrative of how one could always break this down and say it wasn't really new because it was just a DFT plus some glue, or whatever.

Would Eliezer say more about how he'd adjudicate "semantics would be unfamiliar" when resolving?

ChristopherKing avatar
The King

If we find another LLM inside a LLM, would that count?

ChrisCanas avatar
Comment hidden
MartinRandall avatar
Martin Randallbought Ṁ100 of NO

Hypothesis: LLMs do better than GOFAI by having better data, not by having better algorithms.

catfromdevnull avatar
cat from /dev/nullpredicts NO

@MartinRandall or by using more compute/data. It seems to me that there's less than 50% likely that any substantial new patterns of cognition existing at all inside LLMs not to talk about them being discovered by EOY 2026.

VictorLevoso avatar
Victor Levosobought Ṁ100 of YES

@MartinRandall I think you should think of that as LLM do better than GOFAI by having more algoritms.

Because is not like the LLM stores the data anywhere, if LLM can do the things they do is because they have learned some useful circuits from the data.

So I guess the hypotesys could be that there's nothing new in there an is all the kind of things we could do in GOFAI but there's a lot more of it.

But the thing is even if that's the case seems plausible that there's some structure that although it's made of elements that were known in GOFAI combines them in a novel way that Eliezer counts.

Also It feels like there's something wrong with the question resolution if it leads to a sistuation were we undertand LLM but it doesn't resolve yes.

YoavTzfati avatar
Yoav Tzfatibought Ṁ300 of YES

I think this is a massive update up - from what I understand they have identified certain activations that correspond to "truthfulness". Maybe this isn't full "transparency", but perhaps this is the start of a line of research that leads to a yes resolution? "truthfulness" isn't something we could represent in 2006 right?

wadimiusz avatar
Vadimpredicts YES

@YoavTzfati surely people were using truthfulness back then? like, as a simple boolean thing?

YoavTzfati avatar
Yoav Tzfatipredicts YES

@wadimiusz I think that representing the truth of a fact with a boolean is distinct from deciding whether to tell the truth or lie. I think this paper deals with the latter

wadimiusz avatar
Vadimpredicts YES

@YoavTzfati like, this innovation is about "apparently LLMs can sort of distinguish truthful facts from other stuff, and we can push its internal button to produce the truth specifically", so this is cool in terms of controllability, but noticing that LLMs store this distinction is not a big update, right? it doesn't explain why LLMs are strong, it's not like "here's this nontrivial model that LLMs have under the hood, that we totally failed to invent ourselves, and that's what makes them so cool" (which I think is the intent of this market). cuz we already had the notion of truthful statements and they didn't make our models so strong before DL came along.

YoavTzfati avatar
Yoav Tzfatipredicts YES

@wadimiusz Hmm... maybe I misinterpreted the market then? @EliezerYudkowsky do you think this paper is relevant (as in research that builds on this will lead to a YES resolution)?

NLeseul avatar
NLeseulpredicts NO

@YoavTzfati Just based on the discussion here so far, this sounds like pretty much the same thing as the Eiffel Tower example. It suggests that truth as a property of information is represented somewhere in the network (and that you can manipulate the network to exploit that property), but doesn't say anything yet about how that representation happens.

If this is accurate, it sounds pretty promising for the possibility of alignment to me, but it isn't really providing transparency into the algorithms or data structures involved yet (let alone demonstrating their novelty).

EliezerYudkowsky avatar
Eliezer Yudkowskypredicts NO

@YoavTzfati I agree with @NLeseul.