OpenAI has announced a model named o3. What will be the score of this model on Humanity's Last Exam (https://agi.safe.ai/)?
Resolution is based on the score given for o3 on https://agi.safe.ai/. If there are multiple scores (e.g. for "high" and "medium" reasoning), resolution is based on the highest score. If there is no score on https://agi.safe.ai/ within a month from the release of the model, I will use my best judgment.
I will trade on this market.
@Frankas unclear this will be the canonical o3 score on HLE (e.g. is the tool use fair game? is there any pass@k thing happening under the hood?)
@JoshYou
> Resolution is based on the score given for o3 on https://agi.safe.ai/. If there are multiple scores (e.g. for "high" and "medium" reasoning), resolution is based on the highest score.
Seems to imply that so long as it's included on the site it'll count? Though idk if it still counts if it's included as OpenAI Deep Research rather than O3 Deep Research or something.