I believe creating a model 90% as good as o4-mini is within the purview of a smart hobby researcher today.
Specifically, I believe it can be achieved using an open-source model of roughly the caliber available today as base, clever scaffolding for agentic tool-use/web search, and an affordable amount of GPU compute.
Specs:
If a LLM is used as base, it must be open-weights, and released during or before June 2025.
Base model must use fewer than 40B activated params if MoE or fewer than 80B params if dense.
Scaffolding/harness to let the model search/run in a loop is allowed and encouraged. Anything goes as long as it's fully automated and not machine learned.
If compute is used for fine-tuning/reinforcement learning, the cost of the compute must be within $500 or fair market value (whichever is higher.)
"90% as good" is defined as difference between o4-mini and hypothetical model of Cohen's d over task-wise scores in 5 runs of THUDM AgentBench ≤ 0.32.
If there are any competent, good-faith attempts (as judged by me), this market resolves YES if any of them satisfy all criteria, else NO. If there are no such attempts, this market resolves N/A.
Update 2026-05-03 (PST) (AI summary of creator comment): The base model must be open-weights and released during or before June 2025. Models like Qwen 3.5 (released after June 2025) do not qualify, even if they already meet the performance threshold without fine-tuning.
People are also trading
@Sss19971997 This question was meant to pose the question about whether it can be done using just harness engineering + possibly light post-training using models of the time. So Qwen 3.5 would not satisfy this. When this question was posted, the open-weights models had very little agentic RL, if any.
If a LLM is used as base, it must be open-weights, and released during or before June 2025.
For what it's worth, I think it's still possible. GRPO/DPG are much more memory/sample efficient than the old RL methods, even on an older base, and a lot of the agent harness patterns have been standardized by now.
@lumi u don’t think v3.1 is good enough?
I could also do fine tune, but what if trajectories are created by say opus 5 or GPT 6?