Will a GPT4-equivalent model be able to run on consumer hardware before 2024?
closes 2024

"Consumer hardware" is defined as costing no more than $3,000 USD for everything that goes inside the case (not including peripherals).

In terms of "GPT4-equivalent model," I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance. The performance metrics should be within 10% of GPT4's. In the absence of suitable benchmarks I'll make an educated guess come resolution time after consulting educated experts on the subject.

All that's necessary is for the model to run inference, and it doesn't matter how long it takes to generate output so long as you can type in a prompt and get a reply in less than 24 hours. So in the case GPT4's weights are released and someone is able to shrink that model down to run on consumer hardware and get any output at all in less than a day, and the performance of the output meets benchmarks, and it's not 2024 yet, this market resolves YES.

Sort by:
Gigacasting avatar
Gigacastingbought Ṁ555 of NO
Gigacasting avatar
Gigacastingis predicting NO at 45%

150elo per 10x compute.

GPT-4 300 elo ahead

$100m dense, or ~$3mm if everything were done state of the art. And linearly stacks.

PatrickDelaney avatar
Patrick Delaneybought Ṁ10 of YES

You've got Yan LeCun pointing to this paper, claiming GPT4/Bard level performance for LLaMA 65B. That's a fairly god argument for being able to achieve this toward the end of the year because I believe you can already run LLaMA 65B on a tower server that costs less than $3k USD. https://arxiv.org/abs/2305.11206

LIMA: Less Is More for Alignment
LIMA: Less Is More for Alignment
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two…
Gigacasting avatar
Gigacastingbought Ṁ0 of NO

Not implausible (with an enormous cache, incredibly sparse moe, and single digit millions to train)

Just such a brutal architecture to deploy at 0.0X tokens per s that can’t imagine why it would be attempted

All leaderboards board to 2-3yr away

Gigacasting avatar
Gigacastingis predicting NO at 47%
Gigacasting avatar
Gigacastingis predicting NO at 47%
fleventy avatar
fleventyis predicting YES at 37% (edited)

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (May 9)

As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy.

Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.

osmarks avatar

@fleventy This isn't relevant. It's just dispatching efficiently to LLM APIs.

Gigacasting avatar
Gigacastingbought Ṁ180 of NO

100TB of weights and 10PB of cache

Notice that GPT-4 can quote back entire copyrighted books and start buying tape drives

PatrickDelaney avatar
Patrick Delaneybought Ṁ40 of YES

Assuming it's even possible to benchmark GPT4 in the near future, which is doubtful, maybe in 2025 or 2025... we may already be there, depending on what threshold you accept. https://github.com/manyoso/haltt4llm

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 51%

@PatrickDelaney when I say Benchmark in the above comment, I mean run an inference of GPT4. Also see my concerns and questions below to Jacob Pfau about OpenAIs problematic habit of using benchmarking metrics in training. That being said GPT4 all and Llama are already significantly high.

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 51%
JacobPfau avatar
Jacob Pfau

Id recommend as benchmarks: human-eval code top-1, MMLU, and big bench hard.

PatrickDelaney avatar
Patrick Delaney

@JacobPfau Correct me if I'm wrong, OpenAi ignores requests not to train on open datasets including big bench from what I have read so that would be invalid. Further, I'm not sure GPT4 is an inference model that OpenAI will submit to any leaderboards as it is proprietary? Lastly, we would have to compile Big Bench results ourselves based upon the current status of the repo, assuming that a test was even run?

YonatanCale avatar
Yonatan Calebought Ṁ2 of YES

If GPT-4 does some things (like specifically poetry) better, but it's widespread - understood that the new model is better at basically everything else, and by a margin, and nobody would consider using GPT-4 unless they wanted that niche ability - how would you resolve that?

LarsDoucet avatar
Lars Doucet

@YonatanCale "I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance." --> if that condition is satisfied but there's one particular thing that's not well captured by the benchmarks (such as poetry, or performance in rap battles, or coming up with sufficiently delicious cheese soup recipes), that's fine an this still resolves YES.

EricG avatar

So it doesn’t matter if the model can fit in VRAM, just that it runs inference on a consumer PC no matter how slow?

LarsDoucet avatar
Lars Doucet

@EricG Yep. Just has to run inference on a consumer PC, and return a reasonable length message in less than a day. Run it on CPU if you have to, this market doesn't care.

WillJanzen avatar
Will Janzenbought Ṁ20 of YES

Maybe I'm way off here... but I thought most of the computing resources go into training the model, and then it's much less computationally expensive to run, though I know GPT-4 is huge. I guess this is largely contingent on chip prices, right?



EricG avatar

@WillJanzen Just curious, this medium article says GPT4 has 170 trillion parameters. Do you know where it gets that information? I haven’t kept up to date with the rumor mill but that strikes me as unlikely

osmarks avatar

@WillJanzen We have FlexGen and stuff and the time limit on this is rather long, so if a GPT-4-equivalent model was available and not ridiculously large this would be satisfied.

Related markets

In what year will a GPT4-equivalent model be able to run on consumer hardware?2025
In what year will a GPT4-equivalent model be able to run on consumer hardware?2024
Will a large GPT-4 equivalent competitor model be revealed by the end of 2023?61%
Will we have an open-source model that is equivalent GPT-4 by end of 2025?82%
Will GPT-4 be public during 2023?88%
Will GPT-5 be released before 2025?52%
Will I be able to use base GPT-4 at any time in 2023?28%
Will GPT-4's max context window increase by the end of 2023?37%
Will GPT-4's parameter count be announced by the end of 2023?5%
Will GPT-5 come out before 2027?88%
Will a GPT-4 quality model be trained for under $10.000 by 2030?78%
GPT-4 performance and compute efficiency from a simple architecture before 202625%
Will GPT-4 exceed chatGPT interest? (2023)6%
Is GPT-4 best? (2023)74%
Will a model be trained using at least as much compute as GPT-3 using AMD GPUs before Jan 1 2026?75%
Will ChatGPT Utilizing the GPT-4 Model Be Able To Properly Render a House By End of 2023?41%
Will GPT-4 be the most searched model of 2023?16%
Will we have a FOSS GPT-4 Equivalent by the end of the year?15%
Will GPT-5 be announced before 20249%
Is GPT-4 best? (Thru 2025)26%