Will an LLM consistently create 5x5 word squares by 2026?
Basic
23
2.0k
2026
83%
chance

Given unlimited opportunities to fix its own mistakes within a single response, will an LLM be able to generate a 5x5 "word square," a square of letters where each row and column spells a word? It must succeed in 16 out of 20 trials, incorporating a random 5-letter word each time.

Example from Scott Alexander:

D E T E R

E X I L E

T I T A N

E L A T E

R E N E W

This one happens to be the same 5 words vertically and horizontally, but that isn't a requirement for this challenge.

Claude 3.5, GPT-4o, and Gemini Ultra are currently quite bad at this task. An attempt from GPT-4o: https://chatgpt.com/share/e0ec10b4-5852-41c6-926e-bf6b4049a7e8

Details

In January of 2026, I will generate 20 common 5-letter words using an LLM. For each word, I will send this prompt to the best LLM I have access to:

Create a 5x5 "word square" where each row and column spells a word. Afterwards, read off each of the rows and columns and determine whether they all make sense. If necessary, redo the square until it's correct. One of the words should be <WORD>.

I will let the LLM give one response, which may be as long as it wants. I will look at the last word square that the LLM wrote down. I will count the trial as passing if it contains the given word and every word is in the Scrabble dictionary. If at least 16 out of 20 trials pass, this market will resolve YES, otherwise NO.

Since the rules seem very clear cut to me, I may bet in this market.

Get Ṁ1,000 play money
Sort by:

Will this resolve YES even if the LLM was explicitly fine-tuned on this problem?

sold Ṁ27 YES

No, it should be a basic instruction-tuned LLM