Skip to main content
MANIFOLD
Do GPT prompts sharing their first 90% of text share most of their computation?
10
Ṁ190Ṁ891
resolved Jan 21
Resolved
YES

Resolves YES if Nemehodva or Eliezer (who have so far doubted this on Discord) admit this. The first of them to bet at least 1k on NO against me gets to require that they in particular admit this.

Resolves NO if I admit this is false.

Resolves N/A if neither happens within a week, which would be sad.

Close date updated to 2023-01-24 12:59 am [<- dunno where "closes in an hour" came from.]

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ225
2Ṁ151
3Ṁ3
4Ṁ3
Sort by:
predictedNO

Ok, I think I’m convinced, and my main mistake was forgetting that GPT uses causal attention.

Transformers are N^2 * D in attention, and N * D^2 in feed-forward.

GPT-3 at ~10k dims is dominated by the feed-forward layer (for prompts <<10k tokens), and thus linear in token count.

and generative transformers are (usually) run with triangular forward-only attention, even for the prompt, and thus the entire first 90% can be re-used,

In short, causal attention (prior tokens attended to, never future ones) and linear scaling mean X% shared = X% compute shared.

predictedYES

note that the identical output is only due to causal masking; it won't actually cause the invocations to share compute unless the hidden states are cached, which is usually not done for space efficiency reasons as far as I know, eg I don't think openai caches. But if you know that you're going to be continuing the same sequence, you can always keep the activations in memory and keep reinvoking with the extended sequence.

also, this is all mostly irrelevant, because https://github.com/BlinkDL/RWKV-LM style models are the future anyway. Transformers are a temporary research artifact.

predictedYES

okay here we go, this one is better. scroll to "text generation transformer" https://peterbloem.nl/blog/transformers

predictedYES
predictedYES

@L ...whoops, posted google results too quick, these are the wrong step of the process. ignore these two links.

My Discord statement:

If two texts of length 1000 share a prefix of length 900, you get to reuse 81% of the computation; therefore if you use most of an enlarged context window for an immutable plot summary, the blowup is effectively linear instead of quadratic.

Nemehodva's Discord statement:

Hm, I'm not sure that's true? Maybe for the first layer, but then you get non-local dependencies

Eliezer's Discord statement:

You only get to reuse computation in the first layer. After that everything depends on everything.

predictedYES

Upon request, what my Discord statement means:

If you call GPT with two prompts of length 1000, and their first 900 tokens are the same, then >80% of the numbers GPT internally calculates in response to each prompt will be the same. GPT runs in O(prompt_length^2) on the first prompt, and could be run in O(prompt_length*length_of_new_suffix) on the second one.

predictedYES

@Gurkenglas eliezer incorrectly thinks that GPT is an encoder-only architecture. the key difference between encoder and decoder is causal masking.