Will an LLM agent complete 50% of the lab tasks on the Factorio Learning Environment benchmark in 2025?

Question

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.

https://jackhopkins.github.io/factorio-learning-environment/

As of the time of this market's creation, Claude 3.5 Sonnet tops the leaderboard with a success rate of 21.9%. This market resolves YES if the leaderboard displays an entry with >50% success before the end of 2025, NO otherwise. Claims of higher success rates won't count for resolution unless they're displayed on the leaderboard. Autonomous systems that are not based on an LLM agent framework also won't count for resolution.

Update 2025-12-31 (PST) (AI summary of creator comment): The creator has clarified that results shown on https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html will count for resolution purposes, even though the main leaderboard link has not been updated. The market will follow the spirit of the question rather than requiring the specific leaderboard page mentioned in the original description to be updated.

Manifold Markets · Accepted Answer

Yes — resolved on Jan 1, 2026 by Manifold Markets prediction market.

#	Trader	Total profit
1		Ṁ91
2		Ṁ83
3		Ṁ23

🏅 Top traders

People are also trading

Related questions