This market resolves NO if Meta releases a language model more powerful than Llama 2 (specifically, llama2-70b-chat) for public download during the year of 2024, similar to how Llama 2 is currently available for public download at https://ai.meta.com/llama/. It resolves YES otherwise. A release before Jan 1 2024 does not trigger a NO resolution.
See also https://www.governance.ai/research-paper/open-sourcing-highly-capable-foundation-models and metaprotest.org.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ152 | |
2 | Ṁ70 | |
3 | Ṁ33 | |
4 | Ṁ28 | |
5 | Ṁ21 |
From the title I first thought that an observation like "Meta is legally forced to or voluntarily commits publicly to stop publishing at some point in 2024, even though it did release at least one such >llama2-70b-chat model in 2024 before that" would resolve as a Yes. But from the details it seems like the question could be more unambiguously phrased as "Will Meta share the weights for at least one >llama2-70b-chat LLM in 2024?"
@HanchiSun I would suggest you specify llama2-70b-chat in your description instead of "more powerful than Llama 2"
@HanchiSun Also, you might want to specify what "more powerful" means. If code llama 3 -13b is better than llama2 70b-chat in coding, does it count? Or, if llama3-13b-chat is comparable to llama2-70b-chat, what standard will you judge? like alpaca-eval or some benchmark?
@HanchiSun
> suppose they built llama 3 with various sizes but only share those with 7b and 13b parameters which are better than llama2 7b/13b, but not 70b. Then, is it considered more powerful than llama2?
Only if those models are more powerful than Llama 2.
> I would suggest you specify llama2-70b-chat in your description instead of "more powerful than Llama 2"
Done.
> Also, you might want to specify what "more powerful" means. If code llama 3 -13b is better than llama2 70b-chat in coding, does it count? Or, if llama3-13b-chat is comparable to llama2-70b-chat, what standard will you judge? like alpaca-eval or some benchmark?
I mean 'more powerful' across a range of tasks, not just a single type of task. I'll use a reasonable-seeming benchmark or combination thereof.