Will a GPT4-equivalent model be able to run on consumer hardware before 2024?
resolved Jan 6

"Consumer hardware" is defined as costing no more than $3,000 USD for everything that goes inside the case (not including peripherals).

In terms of "GPT4-equivalent model," I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance. The performance metrics should be within 10% of GPT4's. In the absence of suitable benchmarks I'll make an educated guess come resolution time after consulting educated experts on the subject.

All that's necessary is for the model to run inference, and it doesn't matter how long it takes to generate output so long as you can type in a prompt and get a reply in less than 24 hours. So in the case GPT4's weights are released and someone is able to shrink that model down to run on consumer hardware and get any output at all in less than a day, and the performance of the output meets benchmarks, and it's not 2024 yet, this market resolves YES.

Get Ṁ600 play money

🏅 Top traders

#NameTotal profit
Sort by:

Resolves NO.

all this requires is memory right? Let's imagine gpt-4 was 1TB of weights, which means ~2Tflop/token.
The Mac M1 integrated GPU can do like 10Tflop/s, so that's 300 tokens/minute only considering flops.
The Mac M1 has 400GB/s of memory bandwidth, which limits it to 20 tokens/minute.
So all you need is 1TB of memory, which would cost $2400 today.

If you used an NVME ssd instead of ram, you only have 5GB/s bandwidth, so only 3 tokens/minute, which only costs $50!

So seems guaranteed that you can get completions within a day, even on a $500 budget, let alone $3k

@TaoLin It also requires you to have access to a GPT-4-equivalent model's weights.

predicted YES

@osmarks the top post doesnt require it to be open source, so if someone chooses to reveal that they can run gpt4/gemini/whatever closed source model on their PC, which they obviously can, that might count

I have llama2 7B running on a single GPU with similar results as https://github.com/ggerganov/llama.cpp
So, few tokens per second. I have no doubts the full llama2 70B can run on a single consumer machine to generate 500+ tokens in less than 24h.

@GiovanniRizzi LLaMA 2 is not at all GPT-4-equivalent.

predicted NO

150elo per 10x compute.

GPT-4 300 elo ahead

$100m dense, or ~$3mm if everything were done state of the art. And linearly stacks.

You've got Yan LeCun pointing to this paper, claiming GPT4/Bard level performance for LLaMA 65B. That's a fairly god argument for being able to achieve this toward the end of the year because I believe you can already run LLaMA 65B on a tower server that costs less than $3k USD. https://arxiv.org/abs/2305.11206

Not implausible (with an enormous cache, incredibly sparse moe, and single digit millions to train)

Just such a brutal architecture to deploy at 0.0X tokens per s that can’t imagine why it would be attempted

All leaderboards board to 2-3yr away

predicted NO
predicted NO
predicted YES

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (May 9)

As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy.

Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.

@fleventy This isn't relevant. It's just dispatching efficiently to LLM APIs.

predicted YES

@osmarks something like this could be considered an ensemble model (one that has sub-models).

Depends on the definition of model. Single ANNs only, or can it be more sophisticated? @LarsDoucet

@fleventy That does blur the line somewhat but I think it's outside the scope of what I was intending.

100TB of weights and 10PB of cache

Notice that GPT-4 can quote back entire copyrighted books and start buying tape drives

Assuming it's even possible to benchmark GPT4 in the near future, which is doubtful, maybe in 2025 or 2025... we may already be there, depending on what threshold you accept. https://github.com/manyoso/haltt4llm

predicted YES

@PatrickDelaney when I say Benchmark in the above comment, I mean run an inference of GPT4. Also see my concerns and questions below to Jacob Pfau about OpenAIs problematic habit of using benchmarking metrics in training. That being said GPT4 all and Llama are already significantly high.

predicted YES

Id recommend as benchmarks: human-eval code top-1, MMLU, and big bench hard.

@JacobPfau Correct me if I'm wrong, OpenAi ignores requests not to train on open datasets including big bench from what I have read so that would be invalid. Further, I'm not sure GPT4 is an inference model that OpenAI will submit to any leaderboards as it is proprietary? Lastly, we would have to compile Big Bench results ourselves based upon the current status of the repo, assuming that a test was even run?

If GPT-4 does some things (like specifically poetry) better, but it's widespread - understood that the new model is better at basically everything else, and by a margin, and nobody would consider using GPT-4 unless they wanted that niche ability - how would you resolve that?

@YonatanCale "I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance." --> if that condition is satisfied but there's one particular thing that's not well captured by the benchmarks (such as poetry, or performance in rap battles, or coming up with sufficiently delicious cheese soup recipes), that's fine an this still resolves YES.

So it doesn’t matter if the model can fit in VRAM, just that it runs inference on a consumer PC no matter how slow?

@EricG Yep. Just has to run inference on a consumer PC, and return a reasonable length message in less than a day. Run it on CPU if you have to, this market doesn't care.