Resolves YES if 7 days after the first 400B+ open source Llama 3 model appears on the lmsys leaderboard, it is ranked higher than GPT-4-Turbo-2024-04-09
.
Resolves NO if it ranks lower or if no such model is released in 2024.
Resolution of this market is delegated to @jskf
Zvi asked @Austin to take over resolution of this market, who delegated it to me. I discussed it with other mods, primarily @chrisjbillington and @jacksonpolack . We have decided that for the purposes of this market GPT-4 will refer to GPT-4-Turbo-2024-04-09
. The title and description have been updated to make this clear and to remove subjective judgment. This decision is final. Future questions and requests for clarification concerning this market should be directed to me.
Explanation:
We think both "GPT-4 at time of market creation" and "GPT-4 at time of Llama 3 release" were reasonable interpretations of "GPT-4" when the market was created. It's unfortunate that this ambiguity existed and that it wasn't clarified when @alexlitz raised it in January. Traders and the market creator both bear responsibility for misinformed trades that occur as a result of this kind of ambiguity. When traders are surprised by a reasonable clarification (as happened yesterday) they can give feedback through the resolution rating system or in the comments. However, if a clarification is posted while the market is open for trading, we feel traders have a reasonable expectation that it will not be retracted when they trade in response to this. If a clarification is up for debate, this should be made clear and (ideally) the market closed until a decision is made.
The interpretation we chose most closely aligns with the initial clarification, and out of the options Zvi proposed after retracting his clarification we feel it's the most straightforward interpretation of the market title as of today.
We're not going to decide the criteria based on poll results. In general we think this is not a good way to resolve ambiguity once the criteria are already in dispute. On top of that there are two different polls, the second of which (on twitter) was announced 3 hours after the first (on Manifold), which was announced 16 hours after the initial clarification. We think neither poll should have happened and neither poll currently overwhelmingly favors a particular interpretation.
Resolves YES. [edit: one week from now, assuming nothing crazy] Llama-3.1-405b is a full two ranks (3 positions by elo) higher than GPT-4-Turbo-2024-04-09.
Zvi asked @Austin to take over resolution of this market, who delegated it to me. I discussed it with other mods, primarily @chrisjbillington and @jacksonpolack . We have decided that for the purposes of this market GPT-4 will refer to GPT-4-Turbo-2024-04-09
. The title and description have been updated to make this clear and to remove subjective judgment. This decision is final. Future questions and requests for clarification concerning this market should be directed to me.
Explanation:
We think both "GPT-4 at time of market creation" and "GPT-4 at time of Llama 3 release" were reasonable interpretations of "GPT-4" when the market was created. It's unfortunate that this ambiguity existed and that it wasn't clarified when @alexlitz raised it in January. Traders and the market creator both bear responsibility for misinformed trades that occur as a result of this kind of ambiguity. When traders are surprised by a reasonable clarification (as happened yesterday) they can give feedback through the resolution rating system or in the comments. However, if a clarification is posted while the market is open for trading, we feel traders have a reasonable expectation that it will not be retracted when they trade in response to this. If a clarification is up for debate, this should be made clear and (ideally) the market closed until a decision is made.
The interpretation we chose most closely aligns with the initial clarification, and out of the options Zvi proposed after retracting his clarification we feel it's the most straightforward interpretation of the market title as of today.
We're not going to decide the criteria based on poll results. In general we think this is not a good way to resolve ambiguity once the criteria are already in dispute. On top of that there are two different polls, the second of which (on twitter) was announced 3 hours after the first (on Manifold), which was announced 16 hours after the initial clarification. We think neither poll should have happened and neither poll currently overwhelmingly favors a particular interpretation.
To confirm, this is about leaderboard ranking (rather than raw ELO)?
(To account for margins of error, LMSYS will rank models with overlapping ELO estimates at the same rank):
Bleh. Sorry everyone. I feel pretty bad about this despite the small stakes. I realize that I handled this badly, but also it's distracting me far too much and I need that to stop. I am asking @Austin to take over resolution from here, and have emailed him to that effect - I would urge him to follow what I have laid out but to use his own judgment.
@ZviMowshowitz I've closed the market for trading right now. If other mods disagree we might re-open, but we can't roll back individual trades, so closing seems more conservative.
@ZviMowshowitz that makes sense. It was an interesting question to pose, and I appreciate your efforts! Austin isn't Manifold staff anymore, but being able to transfer market resolution to independent mods (so not me in this market given my position) is an important tool on Manifold, so I'm glad you're utilizing it.
@ZviMowshowitz Option 1: This resolves to GPT-4-Turbo-2024-04-09. It will NOT update further from here if OpenAI improves the model more, no matter what the new item is called.
@ZviMowshowitz Option 2: This resolves to GPT-4-0314. Unless people objected, I would just resolve this to YES at this point (if they objected I would do a Twitter poll to confirm).
@ZviMowshowitz this seems easy to manipulate or at least a Keynesian beauty contest. I appreciate you editing the market title to say the resolution is in question, but I would strongly suggest not changing the criteria after you explicitly set it.
@Jacy I am counting on the people here to be honest about this. Note that we can see exactly who is voting for what, manipulation should be pretty obvious. And I would urge anyone who agrees with Jacy (e.g. Arky?) to vote for option 1.
As a backup, I will do a Twitter poll, which is much harder to manipulate, in case the answer there is overwhelmingly one way or the other.
@Jacy I agree, if you post “it needs an ELO of 1260” and leave it up for people to trade on for 17 hours I think it’s pretty unfair to go back on that.
@ZviMowshowitz I'm not sure "honesty" does much here. You're just asking about which resolution criterion people prefer, not which is the most natural or reasonable. People can be perfectly honest and vote either way, right?
@ZviMowshowitz putting aside fairness and incentives, I'll just add to the mix of considerations that the market has a strong implicit vote for Option 1. There are 68 NO and 23 YES holders. Assuming inefficiencies are balanced (e.g., equal likelihood of each type of holder to be watching the comments), the implicit poll result would be around 3:1. Moreover, while Option 2 is basically a guaranteed YES, Option 1 is far from a guaranteed NO, so in a sense all of those NO votes implicitly favor Option 1, while some of the YES votes may still have had Option 2 in mind with their trading. So the ratio is larger, at least from this point of view.