Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations. We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.
https://jackhopkins.github.io/factorio-learning-environment/
As of the time of this market's creation, Claude 3.5 Sonnet tops the leaderboard with a success rate of 21.9%. This market resolves YES if the leaderboard displays an entry with >50% success before the end of 2025, NO otherwise. Claims of higher success rates won't count for resolution unless they're displayed on the leaderboard. Autonomous systems that are not based on an LLM agent framework also won't count for resolution.