Will it be possible to run an LLM of GPT-4 (or higher) capability on a portable device by 2027?
9
130Ṁ1007
resolved Aug 18
Resolved
YES

By portable, I mean under 3.6 kg (8 pounds). The device should be commercially available.

  • Update 2025-08-06 (PST) (AI summary of creator comment): The creator will wait a few days before resolving, as independent benchmarks appear to show significantly worse performance than those reported in the model card for the proposed qualifying model.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ130
2Ṁ46
3Ṁ3
4Ṁ1
Sort by:

@sortwie Resolves as YES. OpenAI's gpt-oss-20b fits the bill (model card https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf)

Using ollama or llama.cpp you can easily run it on a laptop faster than the original GPT-4 even without GPU. GPU would be significantly faster though (I've got 200 tokens per second on my RTX 4090).

As for capabilities: gpt-oss-20b destroys GPT-4 in a direct comparison. It's not even close. GPT-4 still occasionally struggled with primary school math. gpt-oss-20b aces competition math and programming, while performing at PHD level on GPQA (page 10 of the model card, see screenshot below):

@ChaosIsALadder I'm inclined to agree, but I'll wait a few days. Independent benchmarks appear to be significantly worse than those reported in the model card.

@sortwie Yeah, please do wait to make sure OpenAI isn’t pulling a fast one. My personal anecdote: I was testing the model for work and also thought it was less smart than it should be, until I realized you have to put "Reasoning: high" in the system prompt. I suspect that’s why people get worse results than those in the model card. No model had this until now, so they might have forgotten to do it. I think this is why, e.g., https://artificialanalysis.ai/models/gpt-oss-120b rates gpt-oss a bit worse than Qwen3. When you scroll down, you see it uses a lot fewer tokens than Qwen:

At least in our internal testing gpt-oss always scored at least as well as Qwen3 and occasionally better, but only after I set the reasoning to high. In any case, even without the setting, gpt-oss still utterly trounces the floor with the original GPT-4.

@sortwie Artificial Analysis just improved their score for gpt-oss from 58 to 61, now clearly rating it above Claude 4 Sonnet Thinking, and GPT 4.1 and DeepSeek R1 0528. Like I've said, it turns out that setting up gpt-oss correctly is non-trivial to the point the Artificial Analysis now actually has a benchmark on how well of a job the providers did with the setup. There's quite high variation, which explains the occasional reports about worse than expected performance (https://nitter.net/ArtificialAnlys/status/1955102409044398415):

In any case it's very clear that gpt-oss-120b beats the hell out of GPT 4.1 which in turn is clearly better than GPT-4o / GPT4 (https://openai.com/index/gpt-4-1).

How is this different from "will this device exist"?

© Manifold Markets, Inc.TermsPrivacy