When will a GPT-4 class model run on a single consumer PC?
29
197
1.3K
2030
63%
Before 2025
80%
Before 2026
84%
Before 2027
91%
Before 2028
93%
Before 2029
93%
Before 2030

The market will resolve positively when a LLM model appears which fulfils the following conditions:

  1. It is publicly available to be downloaded (for free or for a fee <$1000).

  2. It can be run on a consumer PC with one CPU and at most one consumer GPU. (It's ok if it requires a high-end model like RTX 5090.)

  3. It achieves the average inference speed of at least 60 words per minute on the high-end consumer hardware.

  4. It is ranked on par with GPT-4 models on Chatbot Arena Leaderboard. It should beat at least one version of GPT-4 models presented in the ranking. If Chatbot Arena no longer exists or doesn't list the model in question, another similar ranking can be used.

I do not bet on my own questions.

Get Ṁ200 play money
Sort by:

Looks like this should resolve to "Before 2025"? We now have reports of >1 word/second on a desktop computer.

@nsokolsky I would wait for some proper benchmarks. These are quantized models, I'd assume they have somewhat worse performance.

@Bair https://arxiv.org/pdf/2404.14047.pdf analyzed the quantized models performance but didn't benchmark against GPT-4.

@nsokolsky @Bair The original Llama-3-70B-Instruct is bfloat16, 140 GB in total. I am not certain how much additional memory you need to actually run inference, but I think it should be about an order of magnitude less than the total size of the weights, so it theoretically should fit in a Mac Studio with 192 GB of unified memory.

I'm not sure whether running the quantized model is faster or slower than non-quantized model, but running 16-bit model shouldn't be more than 4x time slower than 4-bit model provided that full weights fit into memory.

So it seems plausible to me that if you can run Q4 at 14 tokens/sec then you should be able to run non-quantized model at >= 4 tokens/sec.

But, of course, I'll wait until some definite reports before resolving this.

bought Ṁ30 Before 2026 YES

I think you can run llama3 70b on a mac

@Sss19971997 I saw a short video on Twitter of Llama 3 running on an M1 Max and it seems to be fast enough, but it was difficult to tell. It might also depend on the size of the context, so that only the first few tokens are generated with a usable speed.

@traders I don't have a Mac with enough memory, so I can't try it myself. If anyone has one, or knows some more detailed article or video, I would appreciate a link.

Llama 3 fulfills the conditions for being on par with GPT 4 models according to LMSYS Leaderboardd.

@OlegEterevsky Can we look into some clarification on criterion #4? I hadn't seen it when betting (which is definitely my bad!), and I think it's possibly misleading with regard to the title.

GPT-4-0314 and 0613 are already at ELO 1189 and 1163 and place 7/10 in the ranking, respectively. My assumption was that it would be enough to be better than these two, not having to beat GPT-4-Turbo-2024... or GPT-4-1106-preview. I personally think that's more representative of being "a GPT-4 class model", than having to beat the SOTA.

Also, ELO scores for current GPT-4 models are likely to decline further even if they don't get worse, as new better models are released and beat GPT-4 more often (relative to now). This would happen even if quality of GPT-4 stays the exact same for the rest of the year.

@HenriThunberg Good point, I'll change the criterion #4 to a requirement that the model beats at least some version of GPT-4.

bought Ṁ10 Before 2025 YES

@OlegEterevsky Thanks for considering it 🙌

One can already run huge llms on Mac with 192GB integrated memory, right?

@Sss19971997 Very slowly. It will not reach 60 words per minute

@OlegEterevsky you said 60 words per minute, not seconds, right? so 1 word per second?

seems doable

@Sss19971997 60 words per minute, right. The GPU in Apple Silicon is not particularly powerful. I haven't personally tested the GPU in M3 processors, but my guess would be that it will be way slower than a token per second.

I'll be happy to be proven wrong.

More related questions