Question is about any models by competitors vs any current or future openAI models.
To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).
If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will decide the winner. There must be public access for it to be eligible.
Explaining the three main metrics for assessment:
1. Popularity and Accessibility
Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity.
This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality.
The popularity and userbase elements are essentially being used as a substitute for "usefulness".
2. Accuracy and Reliability
Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model.
3. Power & Capability
Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be:
Number of Parameters
Tokens/words processed
Problem-solving abilities
Trainability (an example of an ongoing trainability exercise with GPT4: /Mira/will-a-prompt-that-enables-gpt4-to
If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time.
Resolution Assessment
When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going.
First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner.
Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win.
Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric)
If the decision becomes highly-subjective, I will defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond.
[ Taking advice for updates or any proposed criteria changes ]
Competing for @AmmonLam's subsidy
[Changelog]
01/05/2023: Description updated to reflect thoughts expanded on in comments
I would love to know how this is looking in the eyes of the judges.
To my reading... many models are "as good" as ChatGPT but none are clearly better on technical comparisons... and none are nearly "as popular" right now.
If the judges think differently -- about the state right now -- would be curious to know.
Of course many more months left, both for OpenAI and for the competition.
@Moscow25 I agree, and I think my statement from 2 months ago still applies:
If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.
Grok, Claude, and Gemini (maybe Llama?) are technically competitive, but none clearly surpass GPT4o on language capabilities. If it closed now, it would come down to popularity.
Apologies for the slow response!
@Gen for what it's worth, I think if the top 2 models are from different companies (and likely even if they are from the same company, e.g., GPT-4 and GPT-4o), there will be disputes about which is better technically. Measuring LLM performance is a very contentious and difficult topic, especially with how easy it is for LLMs to be trained to the test (i.e., overfit).
Going off of this comment: https://manifold.markets/Gen/will-there-be-an-ai-language-model#rltduH4nnjsSdl0fGoVl I think it shouldn't
What do you think should be added to the title? I'm open to revisions that add clarity, "surpass" in this context covers multiple criteria, and yeah, it's gross how long the description is + the comments etc., so if anything contradicts the title/description I will update it.
I don't believe Claude 3.5 has surpassed the best OpenAI models enough to resolve this market
OpenAI is still #1 on the LMYS leaderboard
OpenAI still are #1 for market share (last report I saw they were estimated at ~65%)
I don't want this to be some Betamax situation where it is technically better but unadopted (peaked at ~25% market share), and my comment there was clarifying that if someone did release a model that beat them technically, it wouldn't be sufficient unless it was a huge step, or the competitor could leverage it into market share. e.g. openAI lost their #1 on LMYS briefly earlier in the year, but it was extremely close, most people didn't even know about it, and chatGPT were back on top almost immediately.
I have been using claude 3.5 sonnet though, and for the language model capabilities, I think I prefer it.
I added unequivocally to the title: “unequivocally passes” but instantly reverted it, I think this is a good idea but might not exactly fit the spirit - it might, but it’s late here and I need to sleep, so I’ll spend more time on it in the morning and make any changes then
I like this suggestion tho, I’ll get back to this in about 8-9hours
I changed it to, "Strongly surpasses".
"Unequivocally" is close to what I want, but as I have outlined it doesn't exactly need to be unequivocal to resolve yes. If it's unequivocally better it will resolve early, but if at the end of the year there is a model that is meaningfully technically better, it will win. If the power/accuracy capabilities are close, it will resolve based on popularity.
If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.
Zvi writes, "There is a new clear best (non-tiny) LLM" ... Read more: https://thezvi.substack.com/p/on-claude-35-sonnet
What are people expecting in 2024 that would make this resolve yes? OpenAI is the current leader with a model they released in March 2023. They'll probably release something in 2024 which is clearly better than GPT-4. Google just released Gemini, which looks like it's approximately equal to GPT-4, meaning they're still a year behind OpenAI. Everyone else is still around GPT-3.5 level. Where is the 50% chance that OpenAI loses their lead coming from?
@dominic It seems to me more like a matter of "when" than "if". You did make me curious though, if not 2024 - I wonder where people stand on 2025?
There's a ton of AI orgs and only openAI/Microsoft putting resources to the GPT products. It only takes one big breakthrough or a different approach, e.g. Elon letting Grok be unhinged, which leads to increased accuracy or something, to beat openAI. Plus, they have the first to market disadvantage of all the legal heat - to whatever extent that proves to be a problem.
@Gen I’d be less surprised with 2025 than 2024, still think OpenAI has to be the favorite to be “in the lead” by then. I don’t think anyone aside from Google is “one big breakthrough” away from significantly surpassing GPT-4 though, and OpenAI/Microsoft have basically unlimited funding which most other AI labs do not, so I’d be really surprised to see a leading model come from a smaller player.
I would be very surprised if there isn't a model which is considered to 'surpass' GPT-4 in some way before the end of 2024.
I'm curious whether it counts if a bunch of different models come out which surpass GPT-4 in different ways, and then someone networks them together and makes a good UI for it. From what I'm aware, GPT-4 is basically that. It's partly parameter size, but a lot is having a bunch of tightly integrated smaller GPTs which specialise on different things.
Don't get me wrong, I'm really impressed with GPT-4, but I think it's replicable, and LLMs are still so YOUNG as a technology. There are many more ways for this technology to be improved without training a bigger model (and we don't know how much that would help yet).
So maybe Google or Microsoft, or whoever trains a bunch of GPTs really efficiently, then makes a great UI to interact with them, and everyone prefers that model because it's the easiest one to work with, and does less of the annoying stuff that LLMs do. Long term, my bet on someone doing that well would be Apple, but I don't expect them to try by the end of next year.
Maybe the most likely path is Microsoft uses all the OpenAI tech, and their close partnership to integrate with as much Windows stuff as they can, and that's the path to mass adoption. Everyone who uses Office at work will get it, they'll use it to compose emails and write documents and code and whatever. Microsoft can then use that monstrous amount of data to improve their integrations, people get more used to using Copilot, and people want to use ChatGPT because it's not natively integrated.
Was this worth typing? I'm new here. Do I just write down how I see things playing out?
@MattMeskell Definitely worth typing, and you bring up something important
I'm curious whether it counts if a bunch of different models come out which surpass GPT-4 in different ways, and then someone networks them together and makes a good UI for it.
I am willing to defer to people with more AI expertise, but afaik GPT4 is rumoured to be an MoE which basically sounds like 8 models Frankenstein'ed together. If someone else launches a client that Frankenstein's multiple models (that aren't openAI's) and somehow it's better than the best openAI product/model/whatever - that should count. However, it can't be something like poe which just plugs you into the different models, it would need to be one input field (within reason, as chatGPT has multiple input fields where you can modify your prompts by telling it your name/job/etc) that makes the selections behind the scenes and provides an output.
I should also say though, the comparison is for the "language model" part. I don't think any multimodal stuff should impact the decision here outside of the impact it will inevitably have on the popularity or market share