Do GPT prompts sharing their first 90% of text share most of their computation?

Ṁ190Ṁ891

resolved Jan 21

Resolved

YES

ALL

Resolves YES if Nemehodva or Eliezer (who have so far doubted this on Discord) admit this. The first of them to bet at least 1k on NO against me gets to require that they in particular admit this.

Resolves NO if I admit this is false.

Resolves N/A if neither happens within a week, which would be sad.

Close date updated to 2023-01-24 12:59 am [<- dunno where "closes in an hour" came from.]

Market context

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ225
2		Ṁ151
3		Ṁ3
4		Ṁ3

People are also trading

How much compute will be used to train GPT-5?

GPT-4 performance and compute efficiency from a simple architecture before 2026

19% chance

What percentage of mechanistic interpretability is solved for GPT-2?

33% chance

Will the performance jump from GPT4->GPT5 be less than the one from GPT3->GPT4?

79% chance

In yottaFLOPs (10^24), how much compute will GPT-4 be trained with?

What GPT version will be the first that can pass the Turing Test

Sort by:

predictedNO

Ok, I think I’m convinced, and my main mistake was forgetting that GPT uses causal attention.

Transformers are N^2 * D in attention, and N * D^2 in feed-forward.

GPT-3 at ~10k dims is dominated by the feed-forward layer (for prompts <<10k tokens), and thus linear in token count.

and generative transformers are (usually) run with triangular forward-only attention, even for the prompt, and thus the entire first 90% can be re-used,

In short, causal attention (prior tokens attended to, never future ones) and linear scaling mean X% shared = X% compute shared.

predictedYES

note that the identical output is only due to causal masking; it won't actually cause the invocations to share compute unless the hidden states are cached, which is usually not done for space efficiency reasons as far as I know, eg I don't think openai caches. But if you know that you're going to be continuing the same sequence, you can always keep the activations in memory and keep reinvoking with the extended sequence.

also, this is all mostly irrelevant, because https://github.com/BlinkDL/RWKV-LM style models are the future anyway. Transformers are a temporary research artifact.

predictedYES

https://www.backprop.org/transformers

predictedYES

okay here we go, this one is better. scroll to "text generation transformer" https://peterbloem.nl/blog/transformers

predictedYES

https://e2eml.school/transformers.html#masking

predictedYES

@L https://aman.ai/primers/ai/transformers/#masking-features or

predictedYES

@L ...whoops, posted google results too quick, these are the wrong step of the process. ignore these two links.

https://medium.com/@jinoo/a-simple-example-of-attention-masking-in-transformer-decoder-a6c66757bc7d

My Discord statement:

If two texts of length 1000 share a prefix of length 900, you get to reuse 81% of the computation; therefore if you use most of an enlarged context window for an immutable plot summary, the blowup is effectively linear instead of quadratic.

Nemehodva's Discord statement:

Hm, I'm not sure that's true? Maybe for the first layer, but then you get non-local dependencies

Eliezer's Discord statement:

You only get to reuse computation in the first layer. After that everything depends on everything.

predictedYES

Upon request, what my Discord statement means:

If you call GPT with two prompts of length 1000, and their first 900 tokens are the same, then >80% of the numbers GPT internally calculates in response to each prompt will be the same. GPT runs in O(prompt_length^2) on the first prompt, and could be run in O(prompt_length*length_of_new_suffix) on the second one.

predictedYES

@Gurkenglas eliezer incorrectly thinks that GPT is an encoder-only architecture. the key difference between encoder and decoder is causal masking.

People are also trading

How much compute will be used to train GPT-5?

GPT-4 performance and compute efficiency from a simple architecture before 2026

19% chance

What percentage of mechanistic interpretability is solved for GPT-2?

33% chance

Will the performance jump from GPT4->GPT5 be less than the one from GPT3->GPT4?

79% chance

In yottaFLOPs (10^24), how much compute will GPT-4 be trained with?

What GPT version will be the first that can pass the Turing Test

🏅 Top traders

People are also trading

People are also trading

Related questions