How will people run LLaMa 3 405B locally by 2025?
Gaming GPUs + heavy quantization (e.g. 6x4090 @ Q2_0)
Unified memory (e.g. Apple M4 Ultra)
Tensor GPUs + modest quantization (e.g. 4xA100 2U rackmount)
Distributed across clustered machines (e.g. Petals)
Server CPU (e.g. AMD EPYC with 512TB DDR5)

"Cloud" is a boring answer. User base of interest is somewhere between hobbyists with a budget and companies with a couple of self-hosted racks.

Get Ṁ600 play money
Sort by:


My bet is locally on Apple CPU / GPU (by whatever name called).
And since this will still be expensive, the rest will run in a datacenter on server class GPU/inference chips (not sure what those look like as yet).

* Apple will find a way to compress/store weights on firmware such that you can work with say 64Gb RAM.

@VishalDoshi Have there been any "texture compression" decoders for LLM weights prototyped?

bought Ṁ10 Unified memory (e.g.... NO

How many answers are you picking? I’m sure someone somewhere will do each of there

@MingweiSamuel Whatever looks Pareto-dominant based on vibes from Twitter, /g/, and r/LocalLLaMa. For example current 70B meta looks like multiple gaming GPUs or Apple unified memory with very rare DIY-adaptered A100 frankenracks.

If community doesn't settle on viable 405B solutions by EoY everything gets a NO.