"Consumer hardware" is defined as costing no more than $3,000 USD for everything that goes inside the case (not including peripherals).
In terms of "GPT4-equivalent model," I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance. The performance metrics should be within 10% of GPT4's. In the absence of suitable benchmarks I'll make an educated guess come resolution time after consulting educated experts on the subject.
All that's necessary is for the model to run inference, and it doesn't matter how long it takes to generate output so long as you can type in a prompt and get a reply in less than 24 hours. So in the case GPT4's weights are released and someone is able to shrink that model down to run on consumer hardware and get any output at all in less than a day, and the performance of the output meets benchmarks, and it's not 2024 yet, this market resolves YES.
150elo per 10x compute.
GPT-4 300 elo ahead
$100m dense, or ~$3mm if everything were done state of the art. And linearly stacks.
You've got Yan LeCun pointing to this paper, claiming GPT4/Bard level performance for LLaMA 65B. That's a fairly god argument for being able to achieve this toward the end of the year because I believe you can already run LLaMA 65B on a tower server that costs less than $3k USD. https://arxiv.org/abs/2305.11206
Not implausible (with an enormous cache, incredibly sparse moe, and single digit millions to train)
Just such a brutal architecture to deploy at 0.0X tokens per s that can’t imagine why it would be attempted
All leaderboards board to 2-3yr away
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (May 9)
As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy.
Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.
100TB of weights and 10PB of cache
Notice that GPT-4 can quote back entire copyrighted books and start buying tape drives
Assuming it's even possible to benchmark GPT4 in the near future, which is doubtful, maybe in 2025 or 2025... we may already be there, depending on what threshold you accept. https://github.com/manyoso/haltt4llm
@PatrickDelaney when I say Benchmark in the above comment, I mean run an inference of GPT4. Also see my concerns and questions below to Jacob Pfau about OpenAIs problematic habit of using benchmarking metrics in training. That being said GPT4 all and Llama are already significantly high.
Id recommend as benchmarks: human-eval code top-1, MMLU, and big bench hard.
@JacobPfau Correct me if I'm wrong, OpenAi ignores requests not to train on open datasets including big bench from what I have read so that would be invalid. Further, I'm not sure GPT4 is an inference model that OpenAI will submit to any leaderboards as it is proprietary? Lastly, we would have to compile Big Bench results ourselves based upon the current status of the repo, assuming that a test was even run?
If GPT-4 does some things (like specifically poetry) better, but it's widespread - understood that the new model is better at basically everything else, and by a margin, and nobody would consider using GPT-4 unless they wanted that niche ability - how would you resolve that?
@YonatanCale "I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance." --> if that condition is satisfied but there's one particular thing that's not well captured by the benchmarks (such as poetry, or performance in rap battles, or coming up with sufficiently delicious cheese soup recipes), that's fine an this still resolves YES.
So it doesn’t matter if the model can fit in VRAM, just that it runs inference on a consumer PC no matter how slow?
@EricG Yep. Just has to run inference on a consumer PC, and return a reasonable length message in less than a day. Run it on CPU if you have to, this market doesn't care.
Maybe I'm way off here... but I thought most of the computing resources go into training the model, and then it's much less computationally expensive to run, though I know GPT-4 is huge. I guess this is largely contingent on chip prices, right?
@WillJanzen Just curious, this medium article says GPT4 has 170 trillion parameters. Do you know where it gets that information? I haven’t kept up to date with the rumor mill but that strikes me as unlikely
@WillJanzen We have FlexGen and stuff and the time limit on this is rather long, so if a GPT-4-equivalent model was available and not ridiculously large this would be satisfied.