Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2024?
💎
Premium
735
Ṁ280k
Dec 31
16%
chance

Question is about any models by competitors vs any current or future openAI models.

To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).

If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will decide the winner. There must be public access for it to be eligible.

Explaining the three main metrics for assessment:

1. Popularity and Accessibility

Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity.

This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality.

The popularity and userbase elements are essentially being used as a substitute for "usefulness".

2. Accuracy and Reliability

Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model.

3. Power & Capability

Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be:

If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time.

Resolution Assessment

When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going.


First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner.

Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win.

Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric)

If the decision becomes highly-subjective, I will defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond.

[ Taking advice for updates or any proposed criteria changes ]

Competing for @AmmonLam's subsidy

[Changelog]

01/05/2023: Description updated to reflect thoughts expanded on in comments

Get
Ṁ1,000
and
S3.00
Sort by:

Is the new (upgraded) Claude 3.5 Sonnet, and/or GPT-o1, considered sufficient to YES-resolve this prompt?

bought Ṁ50 YES

Does it have to be from someone besides OpenAI?

@Ernie “and other OpenAI models” implies yes to this I think

bought Ṁ500 NO

I would love to know how this is looking in the eyes of the judges.

To my reading... many models are "as good" as ChatGPT but none are clearly better on technical comparisons... and none are nearly "as popular" right now.

If the judges think differently -- about the state right now -- would be curious to know.

Of course many more months left, both for OpenAI and for the competition.

@Moscow25 I agree, and I think my statement from 2 months ago still applies:

If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.

Grok, Claude, and Gemini (maybe Llama?) are technically competitive, but none clearly surpass GPT4o on language capabilities. If it closed now, it would come down to popularity.

Apologies for the slow response!

The weighting of these features has a large impact.

If user count has any kind of serious weighting than ChatGPT will clearly remain on top.

Otherwise, it is kinda a bet if openAI releases something that beats sonnet by EOY but LLMSYS suggests that 4o already does this.

if a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).

Popularity is only relevant if it is disputed which is better technically, e.g. right now.

bought Ṁ500 NO from 38% to 37%

@Gen for what it's worth, I think if the top 2 models are from different companies (and likely even if they are from the same company, e.g., GPT-4 and GPT-4o), there will be disputes about which is better technically. Measuring LLM performance is a very contentious and difficult topic, especially with how easy it is for LLMs to be trained to the test (i.e., overfit).

Isn't Claude 3.5 good enough to resolve this?

I think it will be assessed at EOY, so it would still have to be the best then

But the question states before the end, so anytime it happens, even for a short time, it should resolve to yes.

Going off of this comment: https://manifold.markets/Gen/will-there-be-an-ai-language-model#rltduH4nnjsSdl0fGoVl I think it shouldn't

Will there be an AI language model that surpasses ChatGPT and other OpenAI models before the end of 2024?
54% chance. Question is about any models by competitors vs any current or future openAI models. To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access). If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will most likely decide the winner. There must be public access for it to be eligible. The three main metrics for assessment: 1. Popularity and Accessibility Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity. This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality. The popularity and userbase elements are essentially being used as a substitute for "usefulness". 2. Accuracy and Reliability Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model. 3. Power & Capability Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be: Number of Parameters Tokens/words processed Problem-solving abilities Trainability (an example of an ongoing trainability exercise with GPT4: @/Mira/will-a-prompt-that-enables-gpt4-to If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time. Resolution Assessment When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going. First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner. Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win. Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric) If the decision becomes highly-subjective, I will sell all of my positions, donate the profits, and defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond. [ Taking advice for updates or any proposed criteria changes ] Competing for @AmmonLam's subsidy [Changelog] 01/05/2023: Description updated to reflect thoughts expanded on in comments

Should change the title then. Shouldn't have to read a book to decide whether to bet on a market or not.

What do you think should be added to the title? I'm open to revisions that add clarity, "surpass" in this context covers multiple criteria, and yeah, it's gross how long the description is + the comments etc., so if anything contradicts the title/description I will update it.

I don't believe Claude 3.5 has surpassed the best OpenAI models enough to resolve this market

  • OpenAI is still #1 on the LMYS leaderboard

  • OpenAI still are #1 for market share (last report I saw they were estimated at ~65%)

I don't want this to be some Betamax situation where it is technically better but unadopted (peaked at ~25% market share), and my comment there was clarifying that if someone did release a model that beat them technically, it wouldn't be sufficient unless it was a huge step, or the competitor could leverage it into market share. e.g. openAI lost their #1 on LMYS briefly earlier in the year, but it was extremely close, most people didn't even know about it, and chatGPT were back on top almost immediately.

I have been using claude 3.5 sonnet though, and for the language model capabilities, I think I prefer it.

Write strongly surpass unequivocally

I added unequivocally to the title: “unequivocally passes” but instantly reverted it, I think this is a good idea but might not exactly fit the spirit - it might, but it’s late here and I need to sleep, so I’ll spend more time on it in the morning and make any changes then

I like this suggestion tho, I’ll get back to this in about 8-9hours

I changed it to, "Strongly surpasses".

"Unequivocally" is close to what I want, but as I have outlined it doesn't exactly need to be unequivocal to resolve yes. If it's unequivocally better it will resolve early, but if at the end of the year there is a model that is meaningfully technically better, it will win. If the power/accuracy capabilities are close, it will resolve based on popularity.

If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.

opened a Ṁ1,000 NO at 55% order

FYI, don't pay attention to my trades. I'm trying to divest at a break-even point so that my incentives are aligned with producing the best resolution at the end of the year.

reposted

Zvi writes, "There is a new clear best (non-tiny) LLM" ... Read more: https://thezvi.substack.com/p/on-claude-35-sonnet

bought Ṁ10 YES at 52%
bought Ṁ10 NO at 54%

somewhat similar question but with a more straight forward resolution criteria

What are people expecting in 2024 that would make this resolve yes? OpenAI is the current leader with a model they released in March 2023. They'll probably release something in 2024 which is clearly better than GPT-4. Google just released Gemini, which looks like it's approximately equal to GPT-4, meaning they're still a year behind OpenAI. Everyone else is still around GPT-3.5 level. Where is the 50% chance that OpenAI loses their lead coming from?

predicts YES

@dominic It seems to me more like a matter of "when" than "if". You did make me curious though, if not 2024 - I wonder where people stand on 2025?

There's a ton of AI orgs and only openAI/Microsoft putting resources to the GPT products. It only takes one big breakthrough or a different approach, e.g. Elon letting Grok be unhinged, which leads to increased accuracy or something, to beat openAI. Plus, they have the first to market disadvantage of all the legal heat - to whatever extent that proves to be a problem.

predicts NO

@Gen I’d be less surprised with 2025 than 2024, still think OpenAI has to be the favorite to be “in the lead” by then. I don’t think anyone aside from Google is “one big breakthrough” away from significantly surpassing GPT-4 though, and OpenAI/Microsoft have basically unlimited funding which most other AI labs do not, so I’d be really surprised to see a leading model come from a smaller player.

I would be very surprised if there isn't a model which is considered to 'surpass' GPT-4 in some way before the end of 2024.

I'm curious whether it counts if a bunch of different models come out which surpass GPT-4 in different ways, and then someone networks them together and makes a good UI for it. From what I'm aware, GPT-4 is basically that. It's partly parameter size, but a lot is having a bunch of tightly integrated smaller GPTs which specialise on different things.

Don't get me wrong, I'm really impressed with GPT-4, but I think it's replicable, and LLMs are still so YOUNG as a technology. There are many more ways for this technology to be improved without training a bigger model (and we don't know how much that would help yet).

So maybe Google or Microsoft, or whoever trains a bunch of GPTs really efficiently, then makes a great UI to interact with them, and everyone prefers that model because it's the easiest one to work with, and does less of the annoying stuff that LLMs do. Long term, my bet on someone doing that well would be Apple, but I don't expect them to try by the end of next year.

Maybe the most likely path is Microsoft uses all the OpenAI tech, and their close partnership to integrate with as much Windows stuff as they can, and that's the path to mass adoption. Everyone who uses Office at work will get it, they'll use it to compose emails and write documents and code and whatever. Microsoft can then use that monstrous amount of data to improve their integrations, people get more used to using Copilot, and people want to use ChatGPT because it's not natively integrated.

Was this worth typing? I'm new here. Do I just write down how I see things playing out?

predicts YES

@MattMeskell Definitely worth typing, and you bring up something important

I'm curious whether it counts if a bunch of different models come out which surpass GPT-4 in different ways, and then someone networks them together and makes a good UI for it.

I am willing to defer to people with more AI expertise, but afaik GPT4 is rumoured to be an MoE which basically sounds like 8 models Frankenstein'ed together. If someone else launches a client that Frankenstein's multiple models (that aren't openAI's) and somehow it's better than the best openAI product/model/whatever - that should count. However, it can't be something like poe which just plugs you into the different models, it would need to be one input field (within reason, as chatGPT has multiple input fields where you can modify your prompts by telling it your name/job/etc) that makes the selections behind the scenes and provides an output.

I should also say though, the comparison is for the "language model" part. I don't think any multimodal stuff should impact the decision here outside of the impact it will inevitably have on the popularity or market share

Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules