Kimi K3 Thinking METR 50% time horizon
5
1.1kṀ2385
Apr 14
2%
<1h
4%
1h - 1.5h
6%
1.5h - 2h
18%
2h - 2.5h
17%
2.5h - 3h
16%
3h - 3.5h
21%
3.5h - 4h
5%
4h - 4.5h
4%
4.5h - 5h
3%
5h - 5.5h
5%
Other

This market will resolve to the first 50% time horizon, as reported by METR, of Moonshot AI's Kimi K3 Thinking. If a model in the Kimi K3 family of models is evaluated by METR that is able to reason before providing an answer, like a reasoning model, but it doesn't contain "Thinking" in its name (like Kimi K2 Thinking did), this still counts as Kimi K3 Thinking for the purpose of this market. Kimi K3 Code, Kimi K3 Heavy, these all count if they are the first such model to be evaluated and reported on by METR.

50% time horizon is a measure of AI autonomy based on the length of tasks that AI can do: roughly, it is the time that humans take to complete tasks that an AI system can successfully do 50% of the time. See METR's "Measuring AI Ability to Complete Long Tasks" for the technical definition. Claude 3.7 Sonnet, released in February 2025, was the leading model with a 50% horizon of 59 minutes.

Left bounds inclusive, right bounds exclusive.

See also:

/jim/gpt-52-metr

/jim/claude-45-opuss-metr50-horizon (jim's version)

/Bayesian/claude-opus-45s-metr50-time-horizon (my version)
/Bayesian/gemini-3s-50-time-horizon-per-metr

/Bayesian/grok-420s-metr-50-time-horizon

/Bayesian/grok-5s-50-time-horizon-per-metr

/Bayesian/r2s-50-time-horizon-per-metr

/Bayesian/kimi-k3-thinkings-metr-50-time-hori (this market)

  • Update 2025-12-20 (PST) (AI summary of creator comment): If Kimi K3 is tested on METR with a subpar inference provider (similar to what happened with Kimi K2), the market will still resolve based on those results regardless of whether they may be unrepresentative of the model's true capabilities.

Market context
Get
Ṁ1,000
to start trading!
Sort by:

How does this market deal with poor interference providers? Kimi K2 was tested on METR with a subpar interference provider resulting in (IMO) unrepresentative results. What if the metr k3 analysis falls short for the same reason?

@Lucac8a8 we still use it

I think METR has stopped operating just to spite you

@bens or opus has a 30 days time horizon so it takes them weeks to run its tests

© Manifold Markets, Inc.TermsPrivacy