How will people run LLaMa 3 405B locally by 2025?

565Ṁ317

Jan 1

91%

Gaming GPUs + heavy quantization (e.g. 6x4090 @ Q2_0)

65%

Unified memory (e.g. Apple M4 Ultra)

60%

Tensor GPUs + modest quantization (e.g. 4xA100 2U rackmount)

60%

Distributed across clustered machines (e.g. Petals)

41%

Server CPU (e.g. AMD EPYC with 512TB DDR5)

"Cloud" is a boring answer. User base of interest is somewhere between hobbyists with a budget and companies with a couple of self-hosted racks.

Technology

Get

1,000

to start trading!

People are also trading

Will RL work for LLMs "spill over" to the rest of RL by 2026?

34% chance

Daily LLM assistant personal usage exceeds 2 hours for >10% of users by end-2025?

36% chance

Will it be possible for me to travel from San Francisco to Lighthaven in a Waymo in 2025?

7% chance

MMLU 99% #3: Will SOTA for MMLU (average) pass 99% by the start of 2026?

6% chance

What will Manifolders mostly use LLMs for, by EOY 2025?

USPS LLV still on the road in 2035?

23% chance

Will CARB Proposed In-Use Locomotive Regulation be shown to cause freight/passenger diversion to vehicles before by 2030

45% chance

Will the SF to LA segment of the California HSR be completed by 2050?

52% chance

Will ULA have reused engines by the end of 2030?

45% chance

Will the SF to LA segment of the HSR be completed by 2040?

Sort by:

News:

people eyeballing Mac Studio Thunderbolt clusters on X
yet more progress on clustering in llama.cpp

My bet is locally on Apple CPU / GPU (by whatever name called).
And since this will still be expensive, the rest will run in a datacenter on server class GPU/inference chips (not sure what those look like as yet).

* Apple will find a way to compress/store weights on firmware such that you can work with say 64Gb RAM.

@VishalDoshi Have there been any "texture compression" decoders for LLM weights prototyped?

bought Ṁ10 NO

How many answers are you picking? I’m sure someone somewhere will do each of there

@MingweiSamuel Whatever looks Pareto-dominant based on vibes from Twitter, /g/, and r/LocalLLaMa. For example current 70B meta looks like multiple gaming GPUs or Apple unified memory with very rare DIY-adaptered A100 frankenracks.

If community doesn't settle on viable 405B solutions by EoY everything gets a NO.