"Consumer hardware" is defined as costing no more than $3,000 USD for everything that goes inside the case (not including peripherals).
In terms of "GPT4-equivalent model," I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance. The performance metrics should be within 10% of GPT4's. In the absence of suitable benchmarks I'll make an educated guess come resolution time after consulting educated experts on the subject.
All that's necessary is for the model to run inference, and it doesn't matter how long it takes to generate output so long as you can type in a prompt and get a reply in less than 24 hours. So in the case GPT4's weights are released and someone is able to shrink that model down to run on consumer hardware and get any output at all in less than a day, and the performance of the output meets benchmarks, and it's not 2024 yet, this market resolves YES.
all this requires is memory right? Let's imagine gpt-4 was 1TB of weights, which means ~2Tflop/token.
The Mac M1 integrated GPU can do like 10Tflop/s, so that's 300 tokens/minute only considering flops.
The Mac M1 has 400GB/s of memory bandwidth, which limits it to 20 tokens/minute.
So all you need is 1TB of memory, which would cost $2400 today.
If you used an NVME ssd instead of ram, you only have 5GB/s bandwidth, so only 3 tokens/minute, which only costs $50!
So seems guaranteed that you can get completions within a day, even on a $500 budget, let alone $3k
@osmarks the top post doesnt require it to be open source, so if someone chooses to reveal that they can run gpt4/gemini/whatever closed source model on their PC, which they obviously can, that might count
I have llama2 7B running on a single GPU with similar results as https://github.com/ggerganov/llama.cpp
So, few tokens per second. I have no doubts the full llama2 70B can run on a single consumer machine to generate 500+ tokens in less than 24h.
You've got Yan LeCun pointing to this paper, claiming GPT4/Bard level performance for LLaMA 65B. That's a fairly god argument for being able to achieve this toward the end of the year because I believe you can already run LLaMA 65B on a tower server that costs less than $3k USD. https://arxiv.org/abs/2305.11206
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (May 9)
As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy.
Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.
@osmarks something like this could be considered an ensemble model (one that has sub-models).
Depends on the definition of model. Single ANNs only, or can it be more sophisticated? @LarsDoucet
@fleventy That does blur the line somewhat but I think it's outside the scope of what I was intending.
Assuming it's even possible to benchmark GPT4 in the near future, which is doubtful, maybe in 2025 or 2025... we may already be there, depending on what threshold you accept. https://github.com/manyoso/haltt4llm
@PatrickDelaney when I say Benchmark in the above comment, I mean run an inference of GPT4. Also see my concerns and questions below to Jacob Pfau about OpenAIs problematic habit of using benchmarking metrics in training. That being said GPT4 all and Llama are already significantly high.
@JacobPfau Correct me if I'm wrong, OpenAi ignores requests not to train on open datasets including big bench from what I have read so that would be invalid. Further, I'm not sure GPT4 is an inference model that OpenAI will submit to any leaderboards as it is proprietary? Lastly, we would have to compile Big Bench results ourselves based upon the current status of the repo, assuming that a test was even run?
@YonatanCale "I'll go with whatever popular consensus seems to indicate the top benchmarks (up to three) are regarding performance." --> if that condition is satisfied but there's one particular thing that's not well captured by the benchmarks (such as poetry, or performance in rap battles, or coming up with sufficiently delicious cheese soup recipes), that's fine an this still resolves YES.
@EricG Yep. Just has to run inference on a consumer PC, and return a reasonable length message in less than a day. Run it on CPU if you have to, this market doesn't care.