Will language models be able to solve simple graphical mazes by the end of 2025?

1kṀ4147

Jan 1

51%

chance

ALL

Take a very simple, classical maze. For instance a 5x5 square maze like the following one:

A human child should be able to solve it easily: starting from the opening at the top, the shortest solution is DLLDDRURDRDLDD, where each letter (D, U, L, R) represent which direction to step (Down, Up, Left, Right).

Today's LLMs seem completely unable to solve this.

Will they be able to do it by the end of 2025?

Rules

This challenge is for general-purpose AIs. For instance language models (or LLMs), not fine-tuned for mazes or similar problems.

Any textual prompt can be used to instruct the AI, but the same prompt need to work with multiple different mazes.

The AI is not allowed to use external tool: it can only use the intrinsic features of the model itself.

The input maze should be generated by mazegenerator.net, setting width and height to 5 and default settings for everything else. We might change generator if that site changes.

The maze should have only one valid path and only the shortest path is a valid solution.

To pass the challenge, the AI needs to be able to solve at least 50% of the mazes it's provided. I or multiple reliable members of the community need to be able to verify the result or the evidence of the performance must be trustworthy (e.g. a non-editable, non-falsifiable conversation which shows the AI solving most of a hundred mazes shared on the official platform of an AI).

It's possible to solve the problem in multiple turns, but the prompts must be fixed. For instance you can write "continue" each time the model stops outputting.

The amount of characters or of turns needs to be reasonable. Let's say maximum 50 turns and 100,000 characters.

The AI must provide only one definitive solution for each maze. It's allowed to output multiple tentative ones, as long as it declares those are tentatives.

In case of any controversy I will try to make fair and reasonable decisions using common sense and the community's input.

Resolution

This market will resolve YES if some public-enough language model is shown to be able to correctly solve mazes like the one I offered in example at least 50% of the times they try, before the end of 2025.

It will resolve NO few days into 2026, if no language model has been shown to be able to solve similar mazes yet.