Will the 400B+ open source Llama 3 model rank higher than GPT-4-Turbo-2024-04-09 on the lmsys leaderboard?
171
1.8K
1.6K
Dec 31
51%
chance

Resolves YES if 7 days after the first 400B+ open source Llama 3 model appears on the lmsys leaderboard, it is ranked higher than GPT-4-Turbo-2024-04-09.

Resolves NO if it ranks lower or if no such model is released in 2024.

Resolution of this market is delegated to @jskf

https://chat.lmsys.org/

Get Ṁ200 play money
Sort by:

Zvi asked @Austin to take over resolution of this market, who delegated it to me. I discussed it with other mods, primarily @chrisjbillington and @jacksonpolack . We have decided that for the purposes of this market GPT-4 will refer to GPT-4-Turbo-2024-04-09. The title and description have been updated to make this clear and to remove subjective judgment. This decision is final. Future questions and requests for clarification concerning this market should be directed to me.

Explanation:

We think both "GPT-4 at time of market creation" and "GPT-4 at time of Llama 3 release" were reasonable interpretations of "GPT-4" when the market was created. It's unfortunate that this ambiguity existed and that it wasn't clarified when @alexlitz raised it in January. Traders and the market creator both bear responsibility for misinformed trades that occur as a result of this kind of ambiguity. When traders are surprised by a reasonable clarification (as happened yesterday) they can give feedback through the resolution rating system or in the comments. However, if a clarification is posted while the market is open for trading, we feel traders have a reasonable expectation that it will not be retracted when they trade in response to this. If a clarification is up for debate, this should be made clear and (ideally) the market closed until a decision is made.

The interpretation we chose most closely aligns with the initial clarification, and out of the options Zvi proposed after retracting his clarification we feel it's the most straightforward interpretation of the market title as of today.

We're not going to decide the criteria based on poll results. In general we think this is not a good way to resolve ambiguity once the criteria are already in dispute. On top of that there are two different polls, the second of which (on twitter) was announced 3 hours after the first (on Manifold), which was announced 16 hours after the initial clarification. We think neither poll should have happened and neither poll currently overwhelmingly favors a particular interpretation.

Zvi asked @Austin to take over resolution of this market, who delegated it to me. I discussed it with other mods, primarily @chrisjbillington and @jacksonpolack . We have decided that for the purposes of this market GPT-4 will refer to GPT-4-Turbo-2024-04-09. The title and description have been updated to make this clear and to remove subjective judgment. This decision is final. Future questions and requests for clarification concerning this market should be directed to me.

Explanation:

We think both "GPT-4 at time of market creation" and "GPT-4 at time of Llama 3 release" were reasonable interpretations of "GPT-4" when the market was created. It's unfortunate that this ambiguity existed and that it wasn't clarified when @alexlitz raised it in January. Traders and the market creator both bear responsibility for misinformed trades that occur as a result of this kind of ambiguity. When traders are surprised by a reasonable clarification (as happened yesterday) they can give feedback through the resolution rating system or in the comments. However, if a clarification is posted while the market is open for trading, we feel traders have a reasonable expectation that it will not be retracted when they trade in response to this. If a clarification is up for debate, this should be made clear and (ideally) the market closed until a decision is made.

The interpretation we chose most closely aligns with the initial clarification, and out of the options Zvi proposed after retracting his clarification we feel it's the most straightforward interpretation of the market title as of today.

We're not going to decide the criteria based on poll results. In general we think this is not a good way to resolve ambiguity once the criteria are already in dispute. On top of that there are two different polls, the second of which (on twitter) was announced 3 hours after the first (on Manifold), which was announced 16 hours after the initial clarification. We think neither poll should have happened and neither poll currently overwhelmingly favors a particular interpretation.

Bleh. Sorry everyone. I feel pretty bad about this despite the small stakes. I realize that I handled this badly, but also it's distracting me far too much and I need that to stop. I am asking @Austin to take over resolution from here, and have emailed him to that effect - I would urge him to follow what I have laid out but to use his own judgment.

@ZviMowshowitz I've closed the market for trading right now. If other mods disagree we might re-open, but we can't roll back individual trades, so closing seems more conservative.

@ZviMowshowitz that makes sense. It was an interesting question to pose, and I appreciate your efforts! Austin isn't Manifold staff anymore, but being able to transfer market resolution to independent mods (so not me in this market given my position) is an important tool on Manifold, so I'm glad you're utilizing it.

OK, that community reaction looks clear, so I will reconsider [Please trade responsibly while we do that!]

Apologies to anyone who was misled, etc.

You can LIKE my responses to this comment to vote on which resolution we will choose. I will offer two choices.

@ZviMowshowitz Option 1: This resolves to GPT-4-Turbo-2024-04-09. It will NOT update further from here if OpenAI improves the model more, no matter what the new item is called.

@ZviMowshowitz Option 2: This resolves to GPT-4-0314. Unless people objected, I would just resolve this to YES at this point (if they objected I would do a Twitter poll to confirm).

@ZviMowshowitz this seems easy to manipulate or at least a Keynesian beauty contest. I appreciate you editing the market title to say the resolution is in question, but I would strongly suggest not changing the criteria after you explicitly set it.

@Jacy I am counting on the people here to be honest about this. Note that we can see exactly who is voting for what, manipulation should be pretty obvious. And I would urge anyone who agrees with Jacy (e.g. Arky?) to vote for option 1.

As a backup, I will do a Twitter poll, which is much harder to manipulate, in case the answer there is overwhelmingly one way or the other.

@Jacy I agree, if you post “it needs an ELO of 1260” and leave it up for people to trade on for 17 hours I think it’s pretty unfair to go back on that.

@ZviMowshowitz I'm not sure "honesty" does much here. You're just asking about which resolution criterion people prefer, not which is the most natural or reasonable. People can be perfectly honest and vote either way, right?

@ZviMowshowitz putting aside fairness and incentives, I'll just add to the mix of considerations that the market has a strong implicit vote for Option 1. There are 68 NO and 23 YES holders. Assuming inefficiencies are balanced (e.g., equal likelihood of each type of holder to be watching the comments), the implicit poll result would be around 3:1. Moreover, while Option 2 is basically a guaranteed YES, Option 1 is far from a guaranteed NO, so in a sense all of those NO votes implicitly favor Option 1, while some of the YES votes may still have had Option 2 in mind with their trading. So the ratio is larger, at least from this point of view.

Additional clarification: This will resolve on the 3rd larger Llama 3 model (although if the second one does it, I would consider that sufficient as well). The big one is the one that counts and they have until EOY to release it.

To clarify: Llama-3 would need to be as good as the current GPT-4-Turbo. So it would need an ELO of 1260 right now. We shall see.

I'm guessing it was so high because a several past chatbot arena markets about GPT-4 resolved based on being higher than any gpt 4

bought Ṁ100 YES

@ZviMowshowitz Are you waiting until the end of the year in case they release a better Llama 3 variant? Or only the initial release?

@ZviMowshowitz This seems like a questionable resolution decision, honestly. It's not out of the question that OAI just keeps the GPT-4 label for another year or more even as the capabilities continue to improve. Not clear that traders interpreted the question as a moving target earlier on.

@NoraBelrose I agree. I doubt anyone interpreted this question as a moving target. LLAMA 3 should be compared to the GPT-4 version, which was released by the time this question was created, and not any other versions especially not GPT-4-Turbo.

@Soli @NoraBelrose this is a common tension on Manifold, but for what it's worth, I think the community is good at flagging these ambiguities in the comments sections, at least for popular topics. @alexlitz asked this question 3 months ago. I think, ideally, a mod would have pinned that comment to make it clearer to readers than a bet on this market may in large part be a bet on which way Zvi would resolve that ambiguity.

Oh to be clear I liked Nora's post because I thought it was as good thing to raise, not as an endorsement of the pov lol. I think both current GPT and GPT at time of creation are reasonable interpretations, and agree with Jacy that the best way to prevent confusion is to clarify early

@jacksonpolack I agree that both resolutions are reasonable, so I am checking to see if there was a clear consensus view, and will act accordingly.

Which variant of GPT-4 are you referring to?

predicts YES

Since there are multiple versions of GPT-4 on the LMSys leaderboard (and there could be more) which will it be compared to?

More related questions