I'm looking for ideas for how to operationalize this question.
Hopefully the answer will be pretty obvious, but if it's not, my current plan is to set up a poll here on Manifold, or on Twitter. The main problem would be if a model does somewhat better than GPT-4 on most metrics, but that its qualitative behavior is not noticeably better, in which case I'll probably resolve NO.
GPT-4.5 would not count, but a non-GPT-4 LLM that is less powerful than a 2023-produced GPT-4.5 but more powerful than the current GPT-4 would.
What does "come out" mean? Does being talked about in a paper count? Does a few Google engineers having access count? Does it need to be widely publicly accessible by everyone? What if it's open to the public but there's a limited alpha waitlist?
So models by OpenAI count? Do we wanna measure timelines for models more powerful than GPT-4 by someone who didn't make GPT-4 (and is thus already baselined above everyone else)
Why do you say a GPT-4.5 wouldn’t count? GPT-3.5 integrated into ChatGPT was what started all this hype because it was so much better than GPT-3
@DylanSlagh Because I’d rather this market not resolve based on naming conventions, and because this was the question I wanted an answer to when creating the market. But I think it makes sense to create an alt market with an alt resolution criterion.
From the new models’s paper compare on how many tasks it’s better than gpt-4.
This will def be skewed towards tasks it’s better. But it’s a start (: