By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?
293
closes 2027
54%
chance

In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.

"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).

 

This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").

 

Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.

 

2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).

 

GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.

Sort by:
YoavTzfati avatar
Yoav Tzfatibought Ṁ300 of YES

I think this is a massive update up - from what I understand they have identified certain activations that correspond to "truthfulness". Maybe this isn't full "transparency", but perhaps this is the start of a line of research that leads to a yes resolution? "truthfulness" isn't something we could represent in 2006 right?

https://www.lesswrong.com/posts/kuQfnotjkQA4Kkfou/inference-time-intervention-eliciting-truthful-answers-from

Inference-Time Intervention:
Eliciting Truthful Answers from a Language Model - LessWrong
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model - LessWrong
Excited to announce our new work: Inference-Time Intervention (ITI), a minimally-invasive control technique that significantly improves LLM truthfulness using little resources, benchmarked on the Tru…
wadimiusz avatar
Vadimis predicting YES at 53%

@YoavTzfati surely people were using truthfulness back then? like, as a simple boolean thing?

YoavTzfati avatar
Yoav Tzfatiis predicting YES at 53%

@wadimiusz I think that representing the truth of a fact with a boolean is distinct from deciding whether to tell the truth or lie. I think this paper deals with the latter

wadimiusz avatar
Vadimis predicting YES at 54%

@YoavTzfati like, this innovation is about "apparently LLMs can sort of distinguish truthful facts from other stuff, and we can push its internal button to produce the truth specifically", so this is cool in terms of controllability, but noticing that LLMs store this distinction is not a big update, right? it doesn't explain why LLMs are strong, it's not like "here's this nontrivial model that LLMs have under the hood, that we totally failed to invent ourselves, and that's what makes them so cool" (which I think is the intent of this market). cuz we already had the notion of truthful statements and they didn't make our models so strong before DL came along.

YoavTzfati avatar
Yoav Tzfatiis predicting YES at 53%

@wadimiusz Hmm... maybe I misinterpreted the market then? @EliezerYudkowsky do you think this paper is relevant (as in research that builds on this will lead to a YES resolution)?

NLeseul avatar
NLeseulis predicting NO at 52%

@YoavTzfati Just based on the discussion here so far, this sounds like pretty much the same thing as the Eiffel Tower example. It suggests that truth as a property of information is represented somewhere in the network (and that you can manipulate the network to exploit that property), but doesn't say anything yet about how that representation happens.

If this is accurate, it sounds pretty promising for the possibility of alignment to me, but it isn't really providing transparency into the algorithms or data structures involved yet (let alone demonstrating their novelty).

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 54%

@YoavTzfati I agree with @NLeseul.

AlexMizrahi avatar
Alex Mizrahibought Ṁ100 of YES

What's about "In-context Learning and Induction Heads" https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

I would argue that "in-context learning" and "induction heads" were not known in 2006.

MartinRandall avatar
Martin Randallbought Ṁ100 of NO

@AlexMizrahi This is a great question. But the market dates from September 2022, and that paper was published in March 2022. I would therefore be surprised if this counted.

I think it would help traders on this market if we knew why it doesn't count. Maybe there are several reasons.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 50%

@MartinRandall It's not about knowing algorithms are inside LLMs that we didn't previously know to be inside LLMs; it's about understanding the semantics of a representation or algorithm inside the LLM, which wasn't previously known to AI from before the age of deep learning.

AlexMizrahi avatar
Alex Mizrahiis predicting YES at 52%

@EliezerYudkowsky That paper is specifically about "the semantics of a representation or algorithm inside LLM". Have you read it?

The paper might not be convincing enough to resolve the market, but is it in the right direction?

I.e. suppose Anthropic continues this research and produces a more powerful paper. What would that paper need to say to convince you?

This market's description gives strong "no true Scotsman" vibes, so some fictional positive examples would be helpful.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 52%

@AlexMizrahi This market is not about identifying semantics inside an LLM. It is not about identifying semantics inside an LLM which we did not previously know to be inside LLMs. This market is about identifying semantics inside an LLM such that those semantics, once uncovered, teach us something about semantic representations in general which we did not know in 2006.

AlexMizrahi avatar
Alex Mizrahiis predicting YES at 53%

@EliezerYudkowsky Stefan Banach identified "analogy" as one of the most powerful reasoning tools, e.g. an "ultimate mathematician" would see "analogies between analogies" while ordinary mathematician finds analogies between theorems.

The article mentioned above explicitly identified an induction head in a tiny model - that is not a new algorithm, as it's similar to what is used e.g. in compression algorithms.

But then they talk about more a general case of induction heads which exist in larger models -- and those can be described as use of analogies. I.e. consider few-shot prompting: You give a model a context with examples of problems similar to what you want to solve, and it solves a new problem. How? By finding an analogy between the new problem and the example problems. It works even if there is only a distant analogy between problems, i.e. it can operate on a rather deep levels of abstraction.

I.e. "in-context learning" is (largely) "use of analogies", and it's implemented by "induction heads".

I'm fairly certain that previously there was no algorithm which makes use of analogies, definitely nothing which works on deeper levels of abstraction.

Won't it be a MAJOR DISCOVERY if a tool which was previously understood only on an intuitive level by our top grade reasoning experts will be identified as a concrete algorithm working on semantic representations?

In the article, arguments 4-6 to provide evidence of induction head mechanisms in bigger models. In argument 4 they found a mechanical analogy between a simple induction head they analyzed before and much more complex pattern matching behavior. Again I very much doubt that before 2006 we had an algorithm which automatically identifies patterns of the kind which are used in IQ tests, i.e. we did not have algorithms which demonstrate intelligence of this kind.

Authors still consider it a hypothesis, so more evidence is needed. But if more evidence is found, I think this would satisfy all your criteria.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 54%

@AlexMizrahi Copycat and Metacat are some of the leading previous explicit, semantically understood algorithms for analogies and analogies between analogies.

stuhlmueller avatar
Andreas Stuhlmülleris predicting YES at 46%

More work towards scalable interpretability: Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 47%

@stuhlmueller Note that this is exactly a case of them (purportedly, have not yet skimmed paper) being able to identify what the AI had learned via knowing a prior classical model of what was being learned.

NeelNanda avatar
Neel Nandais predicting YES at 54%

@EliezerYudkowsky That's an accurate summary of the paper, IMO

stuhlmueller avatar
Andreas Stuhlmülleris predicting YES at 45%

Recent progress on understanding neuron activations using language models: https://openai.com/research/language-models-can-explain-neurons-in-language-models

Maybe a similar approach could over time explain the behavior of circuits.

LukasDay avatar
Lukas Dayis predicting YES at 48%

Maybe not?

Between the announcement and the white paper, it seems that the technique described doesn't work that well. The majority of the explanations generated for GPT2-neurons scored poorly and performance degrades as the model grows larger.

The discussion section of the paper also describe limitations such as

  • Neurons might represent many features (the current technique can and does generate explanations along the lines of "X and sometimes Y" but isn't really suited to that)

  • The features neurons might represent could correspond to alien concepts (statistical constructs useful for next-token prediction or natural abstractions humans haven't discovered yet)

  • The technique can only describe correlations between network input and neuron interpretation, not mechanisms.

For me, the paper gives the same vibes as this paper from 2017 titled "Could a Neuroscientist Understand a Microprocessor?" They tried to apply data analysis methods from neuroscience to see if they could tell if a MOS 6507 (that's the microprocessor from the Atari game systems) was playing either Donkey Kong, Space Invaders, or Pitfall and failed to do so.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005268

ChrisKa48d avatar
Chris Kbought Ṁ10 of NO

The premise sounds like "the kind of problem that you don't recognize even if it's staring you in the face". Hence by definition we either had some understanding of the semantics back in 2006 or it is so novel it's unimaginable. Finding patterns for unimaginable things is left as an excercise to the reader.

ThomasKwa avatar
Thomas Kwa

What's an example of a cognitive capability unknown to cognitive science in 2006? MCTS? GAN?

DavidDobrinskiy avatar
David Dobrinskiybought Ṁ100 of NO

Considering how long it took until Anthropic made progress on a transformer 1 layer deep, I am confidently betting NO.

Will be very happy to loose this bet though

L avatar
Lis predicting YES at 37%

I would love to see a market about whether, in your view, we establish that there are no patterns that were not known to the sciences in 2006 hidden in a language model, at all, in any form. I suspect we will be able to decode language models into components that demonstrate that every module is made of something you already understood. The only thing that's unusual is that it's an approximation to solomonoff induction that you can actually run because of getting help from the gradient. The algorithms discovered seem incredibly likely to me to turn out to be fundamental ones we had already found.

EliezerYudkowsky avatar
Eliezer Yudkowskyis predicting NO at 39%

@L That seems very unlikely to me to be true at all, but unlikelier yet to become a known fact anytime soon even conditional on its being true!

Spindle avatar
Spindle

@EliezerYudkowsky rationalussy

L avatar
Lis predicting YES at 39%

@EliezerYudkowsky I would buy a lot of yes, but I'm crazy and I think we can formally verify automatic bounds on simplification of precisely bounded approximations of physical systems, and that we'll of course also be able to do it on neural networks as they're a much simpler physical system. if we can formally verify the trajectory basins of smoothhashlife, we can solve safety by constraining trajectories, and the question is then only what coprotection objective we actually need to formally verify. this is still a research prompt, but I think identifying basins in diffusion models is a viable trajectory towards a representation of discovering agency for physical systems, because the physical model we care about is a causal-information-bottlenecked system (causal physics) that we can safely assume to have invariant global rules. if we can make a jit for physics, then we should also be able to formally verify the jit for physics. which means we should be asking now what properties we want to verify about physical systems as our key question. and I think we will then find - there was nothing we did not know then that language models are strong enough to discover; what we didn't know was how to shave it down to run approximately, fast.

I know your fear is that then, we instantly get superagency; but I would argue that the key thing we need to do to prevent hazardous superplanners is to formalize the boundaries between physical systems, and what it means to traverse a boundary between systems. and I think semi-formal ways to define those boundaries, like the ones humans have been using for hundreds of years, will suffice for a few months to years while we figure out how to check constraints at scale.

any researcher on any topic I'm optimistic about would think I'm crazy. That's fair; I don't think I can bet that any one particular research plan will be the particular one to win the safety-mining lottery. but when you're in the safety mines, it's much less likely to be the one to figure it out than to be nearby when your friend across the room figures it out; so, much of the challenge is networking with enough people that the ones who already know most of what it takes can just talk to each other. safety is mostly a task of networking humans who have almost figured it out, and as a result, I think y'all over at lesswrong are actually about to figure it out, and you just are too chronically anxious to admit that current research directions are really promising actually.

(do hurry up, though. have you considered quitting miri and applying for deepmind?)

L avatar
Lis predicting YES at 37%

@L also, thing I didn't clarify - why buy yes when I think the true answer is probably "well we would have found them if they were there, but they weren't, and we proved it"? because I expect interpretability research to get far enough that this comes down to definition differences, and what becomes a known fact will be close enough to the edge that you're more likely to feel there was something new. Idk, I'm not the best bettor, as you can see from my profits graph. But this is my expectation as a hubris-based researcher with no big successes of my own.

LawrenceChan avatar
Lawrence Chansold Ṁ293 of NO

More heartening progress on LM interp from Redwood Research: https://www.lesswrong.com/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object

Note that they identified an algorithmic task, but they managed to reverse engineer most of the circuit that performs said task (a lot more than the ROME paper).

MartinRandall avatar
Martin Randallbought Ṁ10 of YES

Presumably mana is worth more if this resolves Yes since that means we are less likely to all be dead a few years later.

AndrewHartman avatar
Andrew Hartmanis predicting NO at 41%

I'm really surprised this market is resting as high as it is. If you read Eliezer's writeup (as well as some of his other related articles) it's clear this is a very high bar.

VictorLevoso avatar
Victor Levosois predicting YES at 33%

@AndrewHartman So on one hand it definitely is very hard but I think that Anthropic is making some progress towards this and actually going to try and put a lot of money and effort into this.
Other orgs like Conjeture https://www.lesswrong.com/posts/eDicGjD9yte6FLSie/interpreting-neural-networks-through-the-polytope-lens
are also working towards this and might find something.
2026 is relatively far away in terms of time for research to be done so there's definitely a lot of room to make advances on this if people actually try(assuming we aren't dead by then).

And I expect the amount of interpretability research to increase soon, especially if Anthropic becomes big and popular.
41% does feel too high probably but rn market is back around 30% witch sounds more reasonable.
If the question was about being able to understand a SOTA model in detail while developing it, witch unfortunately is likely the level of intepretability we need then I would be much more pessimistic.

AndrewHartman avatar
Andrew Hartmanis predicting NO at 22%

@VictorLevoso Well, we definitely seem to be hitting that middle stretch of a new field, where the productivity surges, so I can see a really high level argument for YES, but it still seems to me like there's some hefty factors working against it for the resolution criteria. It's got to be a novel cognitive pattern - in a moderately short timeframe - and, most significantly in my opinion, we have to be able to actually peer into the internals of some black box well enough to understand this novel pattern and confirm its uniqueness, y'know?

VictorLevoso avatar
Victor Levosobought Ṁ150 of YES

@AndrewHartman Yeah, but after actually reading what has been done and thinking about it a lot more while writing my Seri MATS aplication it feels much more doable.

Especially if turns out there's some pattern in gpt2 or in a 1l transformer that fits the criterion.

LawrenceChan avatar
Lawrence Chanis predicting NO at 31%

@VictorLevoso There aren't going to be complicated algorithms in a 1l (attention only) transformer; I think the Anthropic mathematical models paper has characterized them basically fully.

Finding a nontrivial nonalgorithmic circuit in GPT-2 would definite count, I think. The main concern I have (and the reason I'm long NO) is that while people have been making progress on interpreting easy-to-specify behavior on ever larger models, we don't really have a good approach that deals with the sort of fuzzy "know it when I see it" type of behavior Eliezer wants explained. (IIRC Neel's SERI MATS task had people interp algorithmic behavior on GPT-2-small?) And the bar Eliezer is setting is quite high; I'm not sure he considers what I consider existing positive examples in vision (the high-low frequency neurons in the inception models, for example) as positive.

LawrenceChan avatar
Lawrence Chanis predicting NO at 31%

Well, except for the embeddings, which is where a lot of the magic is happening even on algorithmic-ish tasks! But those might just not be easily interpretable, in the same way that a 2048-d "low" rank matrix approximation might not be easily interpretable.

L avatar
L

how does this market resolve if we discover that, to some reasonably strong-in-your-view degree, that there are only vanishingly small fragments of unknown semantically understandable algorithms, but that the vast majority are simply naive bayes of algorithms we already knew? because I very much consider it possible that we can reach strong ASI while our learning systems are still only generating subalgorithms that are already semantically known to us. neural networks are effectively high-d linear program forests, after all, so I don't think the thing you're looking for could be there even in the thing you think would be guaranteed to instantiate it, which presumably is some sort of direct solomonoff thingy. Can you construct a dataset that is guaranteed to require such a pattern for the dataset to be accurately represented using a machine learning or GOFAI system, but which does not generate such a pattern in the behavior of the representation algorithm when algorithms from 2006 are used?

I ask because it seems to me that we really did have all the components in 2006 and that modern work has been a task of figuring out how to put them together scalably.

RyanMoulton avatar
Ryan Moulton

Can you give a concrete example of a pattern, that if discovered, would make you resolve it to "yes?"

v avatar
VelocityBot

Spindle avatar
Spindle

when will we have an AI zmart enuff to realize that THIS MARKET IZ RIGGED?

Related markets

By the start of 2026, will I still think that transformers are the main architecture for tasks related to natural language processing?64%
By 2025 will there be a competitive large language model with >50% of the total training data generated from a large language model?79%
Will more than 20 organizations publicly train large language models by 2024?33%
Will any country explicitly regulate the use of Large Language Models by 2024?74%
Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?77%
Will natural language based proof assistants be in common use by 2026?36%
My probability in 2026 that training transformer LMs will eventually lead to inner misalignment issues56%
Will a large language models beat a super grandmaster playing chess by 2028?21%
End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 202636%
By the end of 2024, will at least 1 substantive Linux kernel commit have been written entirel by a large language model?18%
Will there be an AI language model that surpasses ChatGPT and other OpenAI models before the end of 2024?58%
Will Large Language Models be able to match up all queries of my real name matched up with all my online handles by 202659%
By January 2026, will we have a language model with similar performance to GPT-3.5 (i.e. ChatGPT as of Feb-23) that is small enough to run locally on the highest end iPhone available at the time?89%
By 2027 will there be a language model that passes a redteam test for honesty?31%
Will Transformer based architectures still be SOTA for language modelling by 2026?66%
Will mechanistic interpretability be essentially solved for GPT-2 before 2030?24%
Most popular language model from OpenAI competitor by 2026?40%
Will a major video game released in 2026 have NPC dialogue generated on-the-fly by a Large Language Model?47%
By 2024, a significant fraction of philosophers (>20%) take seriously the notion that language models with a size and architecture similar to GPT-3 are partially conscious11%
By 2024, GPTs are proven to be able to infer scientific principles from linguistic data.37%