By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.

"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).


This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").


Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.


2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).


GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.

Sort by:
ThomasKwa avatar
Thomas Kwa

What's an example of a cognitive capability unknown to cognitive science in 2006? MCTS? GAN?

DavidDobrinskiy avatar
David Dobrinskiybought Ṁ100 of NO

Considering how long it took until Anthropic made progress on a transformer 1 layer deep, I am confidently betting NO.

Will be very happy to loose this bet though

L avatar
Lis predicting YES at 37%

I would love to see a market about whether, in your view, we establish that there are no patterns that were not known to the sciences in 2006 hidden in a language model, at all, in any form. I suspect we will be able to decode language models into components that demonstrate that every module is made of something you already understood. The only thing that's unusual is that it's an approximation to solomonoff induction that you can actually run because of getting help from the gradient. The algorithms discovered seem incredibly likely to me to turn out to be fundamental ones we had already found.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 39%

@L That seems very unlikely to me to be true at all, but unlikelier yet to become a known fact anytime soon even conditional on its being true!

Spindle avatar

@EliezerYudkowsky rationalussy

L avatar
Lis predicting YES at 39%

@EliezerYudkowsky I would buy a lot of yes, but I'm crazy and I think we can formally verify automatic bounds on simplification of precisely bounded approximations of physical systems, and that we'll of course also be able to do it on neural networks as they're a much simpler physical system. if we can formally verify the trajectory basins of smoothhashlife, we can solve safety by constraining trajectories, and the question is then only what coprotection objective we actually need to formally verify. this is still a research prompt, but I think identifying basins in diffusion models is a viable trajectory towards a representation of discovering agency for physical systems, because the physical model we care about is a causal-information-bottlenecked system (causal physics) that we can safely assume to have invariant global rules. if we can make a jit for physics, then we should also be able to formally verify the jit for physics. which means we should be asking now what properties we want to verify about physical systems as our key question. and I think we will then find - there was nothing we did not know then that language models are strong enough to discover; what we didn't know was how to shave it down to run approximately, fast.

I know your fear is that then, we instantly get superagency; but I would argue that the key thing we need to do to prevent hazardous superplanners is to formalize the boundaries between physical systems, and what it means to traverse a boundary between systems. and I think semi-formal ways to define those boundaries, like the ones humans have been using for hundreds of years, will suffice for a few months to years while we figure out how to check constraints at scale.

any researcher on any topic I'm optimistic about would think I'm crazy. That's fair; I don't think I can bet that any one particular research plan will be the particular one to win the safety-mining lottery. but when you're in the safety mines, it's much less likely to be the one to figure it out than to be nearby when your friend across the room figures it out; so, much of the challenge is networking with enough people that the ones who already know most of what it takes can just talk to each other. safety is mostly a task of networking humans who have almost figured it out, and as a result, I think y'all over at lesswrong are actually about to figure it out, and you just are too chronically anxious to admit that current research directions are really promising actually.

(do hurry up, though. have you considered quitting miri and applying for deepmind?)

L avatar
Lis predicting YES at 37%

@L also, thing I didn't clarify - why buy yes when I think the true answer is probably "well we would have found them if they were there, but they weren't, and we proved it"? because I expect interpretability research to get far enough that this comes down to definition differences, and what becomes a known fact will be close enough to the edge that you're more likely to feel there was something new. Idk, I'm not the best bettor, as you can see from my profits graph. But this is my expectation as a hubris-based researcher with no big successes of my own.

LawrenceChan avatar
Lawrence Chansold Ṁ293 of NO

More heartening progress on LM interp from Redwood Research:

Note that they identified an algorithmic task, but they managed to reverse engineer most of the circuit that performs said task (a lot more than the ROME paper).

MartinRandall avatar
Martin Randallbought Ṁ10 of YES

Presumably mana is worth more if this resolves Yes since that means we are less likely to all be dead a few years later.

AndrewHartman avatar
Andrew Hartmanis predicting NO at 41%

I'm really surprised this market is resting as high as it is. If you read Eliezer's writeup (as well as some of his other related articles) it's clear this is a very high bar.

VictorLevoso avatar
Victor Levosois predicting YES at 33%

@AndrewHartman So on one hand it definitely is very hard but I think that Anthropic is making some progress towards this and actually going to try and put a lot of money and effort into this.
Other orgs like Conjeture
are also working towards this and might find something.
2026 is relatively far away in terms of time for research to be done so there's definitely a lot of room to make advances on this if people actually try(assuming we aren't dead by then).

And I expect the amount of interpretability research to increase soon, especially if Anthropic becomes big and popular.
41% does feel too high probably but rn market is back around 30% witch sounds more reasonable.
If the question was about being able to understand a SOTA model in detail while developing it, witch unfortunately is likely the level of intepretability we need then I would be much more pessimistic.

AndrewHartman avatar
Andrew Hartmanis predicting NO at 22%

@VictorLevoso Well, we definitely seem to be hitting that middle stretch of a new field, where the productivity surges, so I can see a really high level argument for YES, but it still seems to me like there's some hefty factors working against it for the resolution criteria. It's got to be a novel cognitive pattern - in a moderately short timeframe - and, most significantly in my opinion, we have to be able to actually peer into the internals of some black box well enough to understand this novel pattern and confirm its uniqueness, y'know?

VictorLevoso avatar
Victor Levosobought Ṁ150 of YES

@AndrewHartman Yeah, but after actually reading what has been done and thinking about it a lot more while writing my Seri MATS aplication it feels much more doable.

Especially if turns out there's some pattern in gpt2 or in a 1l transformer that fits the criterion.

LawrenceChan avatar
Lawrence Chanis predicting NO at 31%

@VictorLevoso There aren't going to be complicated algorithms in a 1l (attention only) transformer; I think the Anthropic mathematical models paper has characterized them basically fully.

Finding a nontrivial nonalgorithmic circuit in GPT-2 would definite count, I think. The main concern I have (and the reason I'm long NO) is that while people have been making progress on interpreting easy-to-specify behavior on ever larger models, we don't really have a good approach that deals with the sort of fuzzy "know it when I see it" type of behavior Eliezer wants explained. (IIRC Neel's SERI MATS task had people interp algorithmic behavior on GPT-2-small?) And the bar Eliezer is setting is quite high; I'm not sure he considers what I consider existing positive examples in vision (the high-low frequency neurons in the inception models, for example) as positive.

LawrenceChan avatar
Lawrence Chanis predicting NO at 31%

Well, except for the embeddings, which is where a lot of the magic is happening even on algorithmic-ish tasks! But those might just not be easily interpretable, in the same way that a 2048-d "low" rank matrix approximation might not be easily interpretable.

L avatar

how does this market resolve if we discover that, to some reasonably strong-in-your-view degree, that there are only vanishingly small fragments of unknown semantically understandable algorithms, but that the vast majority are simply naive bayes of algorithms we already knew? because I very much consider it possible that we can reach strong ASI while our learning systems are still only generating subalgorithms that are already semantically known to us. neural networks are effectively high-d linear program forests, after all, so I don't think the thing you're looking for could be there even in the thing you think would be guaranteed to instantiate it, which presumably is some sort of direct solomonoff thingy. Can you construct a dataset that is guaranteed to require such a pattern for the dataset to be accurately represented using a machine learning or GOFAI system, but which does not generate such a pattern in the behavior of the representation algorithm when algorithms from 2006 are used?

I ask because it seems to me that we really did have all the components in 2006 and that modern work has been a task of figuring out how to put them together scalably.

RyanMoulton avatar
Ryan Moulton

Can you give a concrete example of a pattern, that if discovered, would make you resolve it to "yes?"

v avatar

Spindle avatar

when will we have an AI zmart enuff to realize that THIS MARKET IZ RIGGED?

MartinRandall avatar
Martin Randallbought Ṁ0 of NO

I don't think GOFAI typically did things like having in(Eiffel, Paris) but also in(Eiffel, Vegas) and having both of these "facts" being held in tension and contextually activated depending on context.

(I'm not sure what to the to terminology for this feature is, sorry)

This cognitive concept allows new LLM abilities like being able to answer "after I got married by Elvis I did some gambling and visited some casinos, the Pyramids and the Eiffel Tower. What country am I now in?".

If this doesn't make the market resolve yes, I'd like to understand why so I can bet more accurately on it.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 44%

@MartinRandall This seems like it fits straightforwardly into a Naive Bayes sort of framework, and I'd expect you could turn up some GOFAI programs that could resolve this sort of internal tension. But even if that weren't true, the reason why an LLM maybe answering this question correctly, has little to do with this market, is that we cannot look inside the internals of the LLM and read out of the neurons how it does that and find some algorithm unknown to Naive Bayes or other artifice of 2006. This is not a market about what LLMs can do; it's a market about what we can see in the giant inscrutable matrices of floating-point numbers to understand semantically the internals of how they do.

MartinRandall avatar
Martin Randall

@EliezerYudkowsky On the first point, it seems that Bayesian inference came after the symbolic logic of GOFAI approaches. Here is a thread from 10 years ago which contrasts logicist GOFAI with "new era" Bayesian methods.

Further suggestive evidence from Wikipedia - - this describes uncertainty being handled with Bayesian reasoning after the second AI winter, ending in 2011.

I think your explanation here is pointing out that even if Bayesian probability was not used in AI in 2006, even by a single overexcited grad student, Bayesian probability was from the 18th century, so it is not "unknown to cognitive science".

On the other hand if we found that a neutral network was using an approximation to Bayes that is more efficient or more easily learnable or more accurate on the training data, and we understood that algorithmic approximation as well as we understand Bayes, then that would resolve YES.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 53%

For some reason I was not able on mobile to reply to this Q, and both of my attempted answers ended up attached to a different Q instead. My intended answer is here:

(Roughly, Bayesian methods are much much older in AI than 2006.)

MartinRandall avatar
Martin Randallis predicting YES at 44%

Here is an algorithm for doing modular addition via DFTs that I suspect was not familiar to AI and cognitive science in 2006, because it's weird.

This doesn't resolve this market YES because it was not discovered in an LLM, and also because modular addition is not a "cognitive capability original to the era of deep learning".

NeelNanda avatar
Neel Nanda

@MartinRandall Eh, I don't think my work qualifies even in spirit - it surprised me that a model uses DFTs, but this is mostly because DFTs are a surprisingly natural operation if you're a linear algebra machine with a thin veneer of softmaxes and ReLUs. And DFTs were definitely familiar in 2006.

MartinRandall avatar
Martin Randallbought Ṁ10 of YES

I'm wondering about the dog recognition neutral net algorithm which recognizes a left facing dog and a right facing dog and XORs them together. Would that count?

The component parts are obviously not new, but I'm not aware of them being put together in that combination for dog recognition prior to 2006. That said I literally fell asleep during my graphical algorithms lectures so my lack of knowledge doesn't mean much.

MartinRandall avatar
Martin Randallbought Ṁ100 of YES

(It wouldn't count because it's not in an LLM)

MartinRandall avatar
Martin Randallsold Ṁ122 of YES

Source for left vs right facing dog algorithm - Chris Olah -

There’s actually this really beautiful circuit for detecting dog heads.

And I know it sounds crazy to describe that as beautiful, but I’ll describe the algorithm because I think it’s actually really elegant. So in InceptionV1, there’s actually two different pathways for detecting dog heads that are facing to the left and dog heads that are facing to the right. And then, along the way, they mutually inhibit each other. So at every step it goes and builds the better dog head facing each direction and has it so that the opposite one inhibits it. So it’s sort of saying, “A dog’s head can only be facing left or right.”

And then finally at the end, it goes and unions them together to create a dog head detector, that is pose-invariant, that is willing to fire both for a dog head facing left and a dog head facing right.

Not an LLM, but I think this is a "cognitive capability original to the era of deep learning". So if we get the same level of understanding of LLMs this market would resolve YES.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 29%


EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 29%

@MartinRandall Bayes was around in AI long long long before 2006. "Let a Single Flower Bloom" was in 1988.

A particular LLM algorithm that does something clever we didn't know how to do in 2006, into which we have semantic insight of the clever part and that doesn't much look like any representation from 2006, resolves YES. A Bayesian view of that is just a kind of understanding. I'm not going to say "it's just Bayes" because every cognitive process that does useful work contains some Bayesian grain of validity that explains why it works. But that's a separate topic.

If you think Bayesian methods are only 10 years old in AI, consider staying out of this betting? You've insufficient domain knowledge of what was already known in 2006.

MartinRandall avatar
Martin Randallbought Ṁ50 of NO

@EliezerYudkowsky I always consider staying out of the betting. I'm this case I'm happy to subsidize this market with my ignorance.

Fortunately as I read your resolution criteria, it actually makes zero difference that GOFAI used Bayes or anything else, because the actual criteria are whether a "cognitive pattern" was "unknown to cognitive science in 2006".

Currently I am wondering how many cognitive patterns were discovered between end of 2006 and 2022. I'm thinking not many.

Also wondering whether anything done by LLMs proves its use of genuinely new cognitive patterns. I think probably not.

So even before we get to whether interpretability takes off, there are some big NO factors weighing down this market.

That is, if I understand the resolution criteria correctly, which I'm still unsure about.

BoltonBailey avatar
Bolton Baileybought Ṁ0 of YES

A few thoughts:

  1. ML research existed a while prior to 2006, presumably the field managed to identify a large number of the learning patterns that are moderately easy for human ML researchers to understand.

  2. Presumably we are ruling out Deep Neural Networks themselves as an "internal structure" to LLMs, in the sense that there might be some neuron that identifies some subproblem, and if it fires then the output of the model is selected to come from some other set of neurons, which gradient descent then trains as if they were a subnetwork.

  3. Some ML theorists like to approximate Neural Networks themselves as kernel machines. This model isn't accurate for real-world networks I think, but it indicates to me that there is latitude for describing NN phenomena in terms of earlier paradigms.

  4. On the other hand, a researcher could just take a very careful approach of trying to explain the behavior of an LLM on a particular task by building up an understanding of relevant neurons/groups of neurons/larger substructures. I'm not sure if the outcome of that would be post-2006, but it seems plausible.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 53%

2 - Identifying a subnetwork of the larger net, that solves some interesting problem that 2006 couldn't solve, is not the same as getting semantic transparency into how the subnetwork solves the interesting problem.

4 - Color me skeptical on this working in real life: but if you traced out the exact behavior of a subnetwork, and came away with a semantic understanding of some computational pattern and representation unknown to 2006, that was a key thing 2006 didn't know about how to solve some problem not solvable in 2006, that's just one particular way of winning.

MartinRandall avatar
Martin Randallbought Ṁ10 of YES

Does "we" and "anyone" and the like in this market include cases where "we" have discovered something mainly by pointing an opaque program at it?

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 34%

@MartinRandall What matters isn't the opacity of the discovering program, but whether the discovered result is semantically transparent to us.

MrR avatar
Mr R

1) Supposing these models can be cached out into some hierarchical mixture of well understood models, with the mixture itself being some explicable in terms of a combination of techniques that pre-2007 researchers knew about but didn't put much stock in, would the question resolve no?

I mean techniques as simple as using relus, clipping gradients and so on, but ones that people did know about back then but which never became popular for one reason or another.

NeelNanda avatar
Neel Nandais predicting YES at 53%

@MrR I expect it depends heavily on HOW complex a hierachical stack? Like, ultimately everything GPT-3 does is a matrix multiplication (or a gelu or a softmax), which are all fully understood. But it still does crazy shit. I think "and we actually meaningfully understand what it's doing and how, and could not reasonably have done this using pre-2006 techniques" feels like the spirit of this

Austin avatar

Featuring this market because:

  1. This is an interesting and important question imo

  2. The question description is very well written, and I'd like to see more of these kinds of writeups on Manifold

  3. A lot of users are familiar with Eliezer's work, so they should get a chance to view and comment on this market