Will a new lab create a top-performing AI frontier model before 2028?

1kṀ11k

2028

94%

chance

ALL

Criteria for Resolution:

1. Definition of "New Lab":

- A "new lab" refers to any company, lab, or other entity that is not OpenAI, Anthropic, DeepMind, Google, Meta, Mixtral, xAI, Microsoft, Nvidia or any subsidiary or parent company of them.

2. Top-Performing Generally Capable AI Frontier Model:

- The AI frontier model must achieve no less than a robust second place by performance. This includes:

- Unambiguous first place.

- Unambiguous second place.

- Ambiguous first place.

- Sharing first place.

- Sharing second place does not qualify.

3. Performance Metrics:

- Performance will be judged based on the most well-accepted metrics and user opinions and approvals available by then.

- For example, metrics may include benchmarks such as MMLU, HumanEval, and other relevant AI performance benchmarks.

Technology

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

Will we have better-than-human-aggregate forecasting AIs by the end of 2024?

4% chance

Before 2028, will any AI lab release a frontier model that performs O(n) sequence modeling?

23% chance

What will be the top 3 AI labs in 2025?

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

61% chance

Will any AI model achieve > 40% on Frontier Math before 2026?

71% chance

Will models be able to do the work of an AI researcher/engineer before 2027?

26% chance

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

6% chance

Will an AI achieve >85% performance on the FrontierMath benchmark before 2027?

45% chance

Will an AI achieve >80% performance on the FrontierMath benchmark before 2027?

60% chance

Will there be a significant advancement in frontier AI model architecture by end of year 2026?

Sort by:

Does Deepseek r1 qualify here? 2nd in MMLU, Codeforces, first in SWE-bench Verified, AIME etc.

@AbuElBanat I am watching it, it is definitely a candidate, I want to be sure it does not do benchmark goodhearting. As an example, I want to see what happens with it in the chatbot arena in the near future (but also other indicators of course).

bought Ṁ100 NO

Will this resolve yes if a "new lab" fine-tunes a model released as open-weights by one of the existing frontier labs?

@MaxMorehead Yes.

opened a Ṁ200 NO at 70% order

@IhorKendiukhov Wait, so does that mean this is mostly a market on whether the frontier model will be open-weights, such that dozens of organizations fine-tune their own version of it?

yeah i would think that would go against the spirit of the market, hmm

@EliTyre But it is pretty unlikely that a fine-tuned external version will be unambiguous leader, is not it? This is the main reason why I think it is reasonable to include such an option. I may reconsider it if you and other people find it unreasonable.

@IhorKendiukhov If Meta produces the highest performing model, and continues their current policy of releasing all model weights, I would consider it likely that another org would release a finetune that could be said to be "sharing first", depending on how you interpret that.

I realize that I may have overreacted by selling shares though because I don't see it as particularly likely that Meta will catch up to the other labs before 2028.

bought Ṁ100 NO

@IhorKendiukhov, exactly what @MaxMorehead said.

If the leading model is open-source, then there will be dozens, and possibly hundreds of fine-tuned versions of that model, and most of them will have basically the same performance, because most of the percentage gain in capabilities comes from pretraining, not from fine-tuning.

If you count them separately, they'll all be sharing first place, but they'll all be doing it on the labor of the one major lab that did the massive spend on pretraining.

Given that, I think it make more sense to treat fintuned derivatives of a given model X, as model X.

A reason you might not want to do that is if you guess that someone will figure out some trick to get unambiguously better performance than the standard version of a model by doing something special during fine-tuning.

I don't expect that, because I think we've seen 0 examples of that so far. But admittedly, if there is some fine-tuning trick that gets notably better performance than that of a standardly fine-tuned model, that would be a way for a currently unknown lab to get the top spot. (Though, because fine-tuning is so inexpensive compared to pretraining, they would have no moat. All the major labs could presumably use the fine-tuning trick, and then catch up.)

@EliTyre @MaxMorehead ok, I will not qualify fine-tuned models then.

looking through the LMSYS leaderboard you might be missing at least: 01 AI, Alibaba, Cohere, Nvidia, Reka AI, Zhipu AI

Thanks. I will add Nvidia. With that, the list will remain fixed.

The compute cost of training a cutting edge model is in the hundreds of millions currently. Epoch estimates that it's going to continue to go up by 0.2 OOM each year.

That's without accounting for the human capital costs. Training a cutting edge model is going to require a bunch of engineering schlep, which means hiring some world class people.

You need to have both deep pockets and a strong motivation to start an AI lab for this to make sense. So maybe a national govenrment?

what about Microsoft?

Yes, I think it is reasonable to add Microsoft to the list.

Is this only about language models? Or do video generation models matter as well?

It must be considered a general purpose model with general capabilities. A video generation model can in principle be in this class. If there is a capable video generation model that can be applied for various tasks and it demonstrates strong intelligence capabilities, it will qualify. If, for example, it is just the best model in the category of the most aesthetically beautiful short videos generators or the best advertisement producers, it will not qualify.

IMO actioned conditioned video prediction models could be considered „general“ forward dynamics models/world models. But I see how your definition is more (text-based) task bound. Thanks for clarifying.

bought Ṁ100 NO

If it is a music production model does it count or has to be general purpose language or multimodal model?

It muse be a general purpose model.

foundational models suck at benchmarks, I thought? Like they need to be finetuned / RLHFed to reliably answer questions they are asked? not sure though. I haven't seen openai or anthropic give benchmark or access to their foundational model.