MANIFOLD
Will a language model that runs locally on a consumer cellphone beat GPT4 by EOY 2026?
47
Ṁ1kṀ5.6k
resolved Mar 16
Resolved
YES

GPT4-0314

For the locally run model, we refer to the Language Model alone, not augmented with search/RAG/function_call. It needs a minimum throughput of 4 tokens/second

Not sure what benchmarks people use in 2026, but let’s say LMSYS Arena for the moment. Will change depends on the trend.

Current SOTA:

I am not sure Phi3(3.8B) can fit on a phone. If not, the current bests are MiniCPM and Gemma 2B

  • Update 2026-03-15 (PST) (AI summary of creator comment): The creator is resolving YES based on Gemma 3n being a phone-runnable model with an LMArena rating of 1308, which beats GPT-4-0314's rating of 1207.

  • Update 2026-03-15 (PST) (AI summary of creator comment): The creator is resolving this market YES, citing Gemma 3n as a phone-runnable model with an LM Arena rating of 1308, which beats GPT-4-0314's rating of 1207.

Market context
Get
Ṁ1,000
to start trading!

🏅 Top traders

#TraderTotal profit
1Ṁ460
2Ṁ126
3Ṁ81
4Ṁ63
5Ṁ57
Sort by:

Gemma 3n is a model runnable on phone with a higher LMArena rating of 1308, beating GPT-4-0314's 1207 rating. I am resolving yes.

@Sss19971997 hmmm interesting... my intuition would be that there's rating inflation in all Elo systems as more models are released! You can't just compare an Elo rating now with one from 2 years ago... but LMArena implies that the old rating is still active? Even though it's likely that no one had reviewed GPT-4 in years.

@bens The margin is big enough.

@bens No. why would there be inflation? it's ELo, it's relative, no such thing as inflation

Gemma 2 it 9b is already higher on llmsys arena. Only 2 points though.

Upd: oops sorry was thinking that criteria is <10b, confused with different question

bought Ṁ20 NO

I'm guessing this means any consumer cellphone? E.g. if a model that fits in 32GB RAM beats GPT-4 and there's only 1-2 phones with that much RAM in 2026 (current record is 24GB), this resolves Yes.

@JoshYou yes. Any consumer cellphone

bought Ṁ10 NO

Runs at what rate? Of its token per minute does it count?

@0482 Let’s say 4 tokens/s

if they are allowed to browse the internet - then for sure. If we are talking about encoding all the knowledge. Then probably not.

@Magnus_ Great question...

How should we specify this? I am thinking that it can do RAG for everything inside the phone but no internet connection. What do you think?

Currently, a phone can hold up to 512GB. It is a lot of info, but not the whole internet.

This criterion captures the "local" aspect.

how you think?

@Magnus_ Another option is to say language model only, no local RAG

@Magnus_ I thought about it again. For a fair comparison, we should have the same standard for the local mobile LM since GPT-4 is not using any RAG/search/tools. I have updated the criterion.

© Manifold Markets, Inc.TermsPrivacy