Will xAI's AI actually be "In some important respects, the best that currently exists."
120
802
635
resolved Dec 6
Resolved
NO

Will resolve based on xAI's AI beating all other current models on a specific benchmark or hugging face leaderboard.

If there isn't a clear consensus then I'll leave the result up to a Manifold poll.

"In some important respects" is obviously pretty vague but hopefully we can all have some good discussions about it.

I'll extend the close date if more time is needed for testing and will just resolve N/A if it's looking like it won't actually be released.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ72
2Ṁ45
3Ṁ43
4Ṁ38
5Ṁ36
Sort by:

For anyone confused I resolved based on a poll as I said I would originally in the description if there wasn't a general consensus and no clear benchmark victories for Grok. I also mentioned this multiple times in the comments.

Poll was up for 36 hours (originally I said 24) and I posted it in the comments.

My bad if I should have left the poll up for longer but I'm unfamiliar with the general etiquette and I assume people would want the question resolved.

By “some important respects” he means it’s ability to answer censored content probably. Which is funny because it’s not even true, the open source models do a good job of that. But Elon will brag about that.

predicted YES

@ShadowyZephyr I think he means in regards to updating on real-time content (X/Twitter datasets available )

@Alfie will only standard benchmarks be considered or also stuff like access to twitter data or the model being supposedly less censored?

@Weezing This is how I would have interpreted Elon's claim. He was talking to consumers, not to foundation model researchers.

@Weezing I'll stay true to the original description I wrote and keep it to standard benchmarks. I do agree though that benchmarks and hugging face leaderboards don't cover everything that could be considered "some important respects". I originally chose that as it seemed the simplest/best way to get an objective answer.

I think it's fair to say there is no clear consensus so like I said in the description I'll be putting a vote up after the market closes for 24 hours and will resolve it based on that.

predicted YES

I think instant search access to Twitter (X) dataset is an important and unique advantage.

bought Ṁ50 of YES

Out of the larger, commercial LLMs it certainly stands out as a candidate for being differently aligned - now whether you consider that being 'better in an important respect' obviously depends on your political or even philosophical leanings in a way.

I'm pretty critical of the way bias is defined in benchmarking sets and what 'removing bias' means in practice, among many other big problems with to how HHH is defined, so imo yes, training a large model that does not adhere to these standards (which have very little to do with AI safety in any case) could turn out to be very beneficial.

bought Ṁ10 of YES

What about usability aspects? Grok seems to be somewhat innovative in that regard with access to live content & multiple chats?

The best claim to being "in some respects" the best would be the best at certain benchmarks within its compute class. Is that sufficient to resolve this yes?

@DanMan314 That's an interesting point and probably the most charitable interpretation of what Elon is saying.

Do we know the compute of Grok and how we would define different classes of compute?

I am hesitant to say that's sufficient to resolve yes though as "best that currently exists" really just makes me think he's talking about all models out there.

@Alfie Idk what it means either but it’s taken verbatim from the x.ai website:

I don't have an opinion on what this would mean for the market either.

@DanMan314 There are open source LLMs better than GPT-3.5 atp, And I don’t trust this statement by Elon.

bought Ṁ38 of YES

Some features they're hoping to have:

  • suggest what questions to ask

  • real-time knowledge of the world via the 𝕏 platform

  • answer spicy questions that are rejected by most other AI systems

  • useful to people of all backgrounds and political views

If they can be the best at 2 or more of these then I think Musk's statement is true. Some of these would be quite easy to quote unquote benchmark.

How formal of a benchmark do we need?

bought Ṁ10 YES from 16% to 29%
bought Ṁ40 NO
predicted YES
predicted NO

@Daniel_MC literally none of these are unique though.

bought Ṁ10 YES from 30% to 32%
predicted YES

@PeterBuyukliev it doesn't need to be unique, just better.

@Daniel_MC I don't know much about different formality levels of benchmarks but I don't think it needs to be super formal?

Just testing something "important" vs other AI models.

predicted YES

@Alfie so if I were to quickly do some tests and qualitatively compare responses would that be ok? (Assuming that it's clear enough that it's doing a better job than the main competitors).

For example asking about current events to assess 2.

@Daniel_MC Hmmm I don't know if a test like that counts as a benchmark? I had something more established in mind that has been used to compare models in the past. Sorry, just want to make sure I stay true to the original description I wrote.

If I put it up to a poll (which I probably will given I'm feeling a bit out of my depth here) then a test like the one you mentioned would be important in helping people decide.

predicted YES

@Alfie fair enough. I think that some of the things I mentioned, like real time knowledge, would be pretty plain to see but wouldn't be captured in traditional benchmarks.

Maybe my exercise would be more about documenting / having basic evidence of the features being "the best that currently exists".

Might be helpful in coming to a consensus or in convincing people in a poll.

I think the main thing I want clarity on is what if it is "the best that currently exists" in a category that isn't traditionally assessed with benchmarks (because I don't think most of the things in my list are measured in benchmarks).

predicted YES

@Alfie do you have any thoughts on this?

predicted NO

He released a few benchmarks, and it seems solidly behind gpt-4.

bought Ṁ50 of NO

@PeterBuyukliev

This should be enough to resolve right? Unless the other benchmark is how much like an edgy teenager you sound

bought Ṁ10 YES from 32% to 35%

@Issc I won't resolve just based on these benchmarks as I don't think they test everything that could be considered "some important respects".

bought Ṁ10 YES from 34% to 37%
predicted NO

@Alfie Not trying to argue just wondering what other benchmarks you would consider important?

More related questions