Will there be an open source model that catches up to ChatGPT this year?
resolved Dec 11

Will there be an open source model that catches up to ChatGPT (similar performance within the ballpark of GPT-3.5) this year? As of the time of this posting, the top open source model is UAE's Falcon 40B.

Get Ṁ1,000 play money

🏅 Top traders

#NameTotal profit
Sort by:
predicted YES

yet another candidate to resolve this

bought Ṁ2,500 YES from 92% to 97%
predicted YES

Besides the example first user here mentioned, there are other yI models and llamabased that surpass gpt-3.5. That even goes beyond the necessary ‘similar ballpark’ performance.

@LinHaowen Should this resolve?

bought Ṁ250 of YES

Resolves YES based on 01.ai's models which exceed gpt3.5's benchmarks

predicted YES

@Nikola So we have llama2-70b and this fine tuned evrsion that I would say are in the ballpark of GPT-3.5

Also we have Llama2 and platypus getting 70% on the MMLU, same score as ChatGPT.

predicted YES

Now we have orca-13b, if it is actually open sourced before eoy I can't see the argument for no at this point

predicted NO

Evaluating LLMs is very difficult. This market will heavily depend on the metrics - I'd say the subjective usability and UX of open models is currently not even close to ChatGPT. That is even excluding all the other goodies that you get with ChatGPT (e.g. plugins, browsing, speed, longer context).

Here's some research(ers) that I believe supports my position:


Roughly, some of the conclusions are that the base LLM is very important (which in turn depends a lot on its size) for the breath of knowledge and reasoning abilities. I believe ChatGPT (even 3.5) is still probably significantly larger (in it's original trained form) and higher quality than any of the currently available open models. Due to narratives about safety, we are unlikely to see much in terms of open (especially if by open we mean Apache-2.0-like licensing) releases of very large models.

predicted YES

@Tony122 I think they're saying base LMs need to improve in order for open-source to keep pushing forward, which I agree with. They will hit a wall of diminishing returns eventually.

Falcon-40B afaik uses its own base model, it isn't a copied model, which is part of the reason it does so well.

None of those tweets mentioned any of the best open source models. (Open-source just means you can see the weights/code)

Also performance was the thing mentioned, that has absolutely nothing to do with UX/usability. And I don't think that's even true, chat.lmsys.org is pretty usable.

If you believe Guanaco somehow sucks, try this game where you compare its outputs to ChatGPT. https://colab.research.google.com/drive/1kK6xasHiav9nhiRUJjPMZb4fAED4qRHb?usp=sharing

Guanaco is already beating ChatGPT in benchmarks, and Falcon is reportedly even better. And we still have another 6 months to go before market closes.

The answer seems clear enough to me.

bought Ṁ50 of NO

@ShadowyZephyr Paper authors are always going to show their work in the most positive light, understandably so.

There are many people who'd disagree that open-source = just accessible code/weights (https://en.wikipedia.org/wiki/Open_source), but let's hope the market doesn't come to arguing that point (not very interesting).

Performance is not a concrete metric - there are so many different metrics and benchmarks but what we ultimately care is about how useful these chatbots are - all benchmarks hope to be a proxy for that. Picking a benchmark favourable to some open model proves little imo.

Here's some more people making predictions closer to the NO side:

Seems the answer is not clear to those people (and it is not for me).

@Tony122 I should have specified GPT-3.5 turbo, since that's the version that's publicly available for testing via API. Welps

bought Ṁ50 of YES

You can literally TRY these chatbots now. The 33B version is completely free at chat.lmsys.org, and for the other versions of falcon and guanaco you'd probably need to buy cloud compute.

As for the 2 tweets, the first only mentions GPT-4, and the second is still a much higher bar, matching GPT-3.5 in ALL major tasks (including 8k context, which is odd that they mentioned that, because base ChatGPT has a 4k context not 8k) is a much higher bar vs 'similar performance'.

The first tweet you mentioned, is about GPT-4, not GPT-3.5. And the second one is just wrong. I've tried these models, I'm telling you that even if they aren't as good as ChatGPT, they are close enough to count as within the same ballpark of performance.

And the Wikipedia page says the source code must be available for redistribution and modification, which it is. It is simply not avialable for commercial use, which is completely different.

The benchmarks for LLaMA models are 100% gamed, but I don't think it matters. Even considering that they are worse than the benchmarks say, they still match ChatGPT in most things, and that's enough to resolve this YES.

predicted YES

@Tony122 I suspect that GPT-3.5 Turbo is actually a pretty small model, given how cheap it is and how fast it can run.

predicted YES

@osmarks 175B Parameters = Small??

predicted YES

@ShadowyZephyr gpt-3.5-turbo is not necessarily the same size as the davinci models.

predicted YES

@osmarks Actually pretty sure it is literally just text-davinci but with rlhf/post-training for chat

predicted YES

@ShadowyZephyr I don't think this has ever been confirmed. Note that ChatGPT used to use a different model (render-text-davinci-002 or something like that) until around March.

predicted YES

@osmarks some articles say it's an "improved version of davinci" maybe OpenAI has never confirmed, in that case, I made a mistake, but I still think it's likely

bought Ṁ85 of YES

Falcon-40B and Guanaco-65B already match/exceed ChatGPT in performance, so I don't know why you made this market.

bought Ṁ20 of YES

@ShadowyZephyr I agree.

predicted YES

@ShadowyZephyr Almost every openish LLM claims to beat ChatGPT. Even Alpaca did. I think this mostly just implies that their metrics are bad. I think that by the end of the year people will have gotten together good enough evaluations and training data to make competitive models, but right now we do not appear to be there.

predicted YES

@osmarks if many claimed that they are better it doesn't mean that nobody is better and especially doesn't mean that nobody can match

predicted YES

@qumeric My experience of testing many of these models (admittedly mostly 13-33B ones quantized to 4 bits to fit on my computer) is that they are significantly worse than evaluations say they "should" be in subjective goodness, which makes me distrust the evaluations. I have not actually tested Falcon-40B and Guanaco-65B but my expectation is that they will also not be that good.

predicted YES

@osmarks Alpaca didn’t claim to beat ChatGPT in anything except being open source and fast. Alpaca sucks. Even vicuña didn’t claim to beat ChatGPT. Guanaco and Falcon are the only ones that did. In my opinion they (the ones I tried like Guanaco-33B) are the same or a little behind ChatGPT, not ahead, but that is already ‘similar’ performance. I assume Falcon will be as good as ChatGPT itc

If you could give me examples of what ChatGPT does well that Guanaco fails I’d like to hear that

predicted YES

@ShadowyZephyr I misremembered somewhat, sorry; Alpaca is compared with text-davinci-003 (they say "We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003."), and Vicuna only claims 90% of ChatGPT performance. It is possible that I am partly mixed up with one of the many, many LLaMA finetunes with different datasets. I have not tried Guanaco-65B but will see if I can find a demo somewhere.

predicted YES

@osmarks oh that Alpaca thing is totally wrong it's not even close to davinci. But the vicuna benchmark is more legit, and guanaco-33b does seem ALMOST at the level of ChatGPT.

predicted YES

@ShadowyZephyr I found the Guanaco-33B demo and it is quite impressive, except for some reason at writing code and roleplay-ish tasks.

predicted YES

@osmarks Exactly, I think if that is 33B, then Falcon-40B and Guanaco-65B should be "within similar ballpark of performance." That isn't a super high bar.