By the end of 2024, will there be an LLM prompt that can reliably solve the NYT Connections puzzle?
72
770
1.3K
2025
59%
chance

Introduction

Connections is a unique, playful semantic game that changes each day. It occupies a fairly different space than most of the other games being effectively challenged by Large Language Models on Manifold and elsewhere, being at times both humorous and varyingly abstract. But, it does rely entirely on a simple structure of English text, and only features sixteen terms at a time with up to 3 failed guesses forgiven per day. If you're unfamiliar, play it for a few days!

I think Connections would make a good mini-benchmark of how much progress LLMs make in 2024. So, if a prompt and LLM combo is discovered and posted in this market, and folks are able to reproduce its success, I will resolve this Yes and it'll be a tiny blip on our AI timelines. I will need some obvious leeway for edge cases and clarifications as things progress, to prevent a dumb oversight from ruining the market. I will not be submitting to this market, but will bet since the resolution should be independently verifiable.

Standards

-The prompt must obey the fixed/general prompt rules from Mira's Sudoku market, excepting those parts that refer specifically to Sudoku and GPT-4.

-The information from a day's Connections puzzle may be fed all at once in any format to the LLM, and the pass/fail of each guess generated may be fed as a yes/no/one away as long as no other information is provided.
-The prompt must succeed on at least 16 out of 20 randomly selected Connections puzzles from the archive available here, or the best available archive at the time it is submitted.

-Successful replication must then occur across three more samples of 20 puzzles in a row, all of which start with a fresh instance and at least one of which is entered by a different human. This is both to verify the success, and to prevent a brute force fluke from fully automated models.

-Since unlike the Sudoku market this is not limited to GPT-4, any prompt of any format for any LLM that is released before the end of 2024 is legal, so long as it doesn't try to sneak in the solution or otherwise undermine the spirit of the market.

Get Ṁ200 play money
Sort by:

Plot twist

how's this tracking - any updates?

predicts YES

curious how the testing is going (if anyone is doing it with any regularity)? Some pretty niche and obscure categories recently so I'm curious if these are solved within a max 3-miss prompt. Any new developments?

cc: @snazzlePop

predicts NO

Fine-tuning gpt-3.5-turbo to learn to play "Connections": LINK

I tried this a couple of times with GPT-4 recently and it was pretty good. I think it is least good at getting the ones where it has to recognize things by their lesser known meanings (or as names), so a solution might want to focus on this. That said I’d be surprised if whatever is released this year (4.5 or 5) can’t do it.

@dominic This could be interpretted as a “faith in the new models” market, yeah.

A good question for the coming year: Will Connections become easier on average due to a higher percent of "pseudo-synonym" sets in the puzzles? Did earlier Connections have more meta variety? I don't know, but I get the impression it might be the trend.

predicts NO

Notably there are some Connections from particular days I suspect LLMs will have an easy time with (such as the infamous four-word title puzzle), so random selection robustness is crucial!

bought Ṁ10 of NO

This is a good test, nice idea

Is fine-tuning the LLM allowed?

predicts NO

@CDBiddulph Yes! Any process that doesn’t backdoor in the solutions, since there are a limited number of puzzles.

@Panfilo given that the archive is online now, how will you know if the solutions have been part of the training set?

@Tomoffer I will not, though using the prompt to direct the LLM to match the input with known solutions would be considered backdooring.

bought Ṁ3 of YES

I just gave it a go - I only got through two games before I ran out of GPT-4 quota.

https://connections.swellgarfo.com/nyt/204
It got two groupings easily on this one - the gardening group and the sounds-like group. Then it got stuck on the last two groups and went round in circles without any clue. Fail.

🟩🟩🟦🟦

🟩🟨🟨🟨

🟨🟨🟨🟨

🟪🟪🟪🟪

🟩🟩🟩🟦

🟩🟩🟦🟦

🟩🟩🟦🟩

🟩🟩🟦🟦

🟩🟩🟩🟦

🟩🟩🟦🟩

🟩🟩🟦🟦

🟩🟩🟩🟦

🟩🟩🟦🟦

🟩🟩🟦🟩

🟩🟩🟦🟦

🟩🟩🟦🟦

🟩🟩🟩🟦

🟩🟩🟩🟦

🟩🟩🟩🟩

🟦🟦🟦🟦

https://connections.swellgarfo.com/nyt/205
It completed this one with ease:

🟩🟩🟩🟩

🟨🟨🟨🟨

🟦🟪🟦🟪

🟦🟦🟦🟦

🟪🟪🟪🟪

Note: I did diverge from the stated rules slightly: I gave back a result of either "correct", "incorrect", or "one away", as the game UI does. I would suggest this as an amendment to the resolution criteria, as this is a pretty key piece of info to human solvers too.

predicts NO

@draaglom Fair update! I will add in One Away.

predicts NO

@draaglom That's hilarious! There's huge overlap IRL between players of NYT games and people who follow/participate in US TV game shows. When I solved this one WHEEL MILLIONAIRE PYRAMID PRICE just sprung out. Ha ha on you dumb bot. For now.

ETA: Categories that have thrown me are "iPhone Apps" (seriously I have to give up on Android), something to do with kids movies, couple sports categories, and "hip hop artists whose name starts with a letter, but without the letter."

predicts NO

@ClubmasterTransparent For comparison, today 1:02 no mistakes, understood the last 4. Saw a group of four, next several groups of "many", then one of more specific only four, taking them out produced the last two groups of four.

Is there a faster way? How long is the bot taking to grind through? I imagine it's grinding very fast through a lot of things that appear adjacently in its learning set to "broom" while also grinding through "shaft" and "pollen" and looking for intersections. This way takes a human forever.

It's the process I imagine some smartypantses programming into Excel, then predictive text and Tableau and chatbots, now faster chatbot with bigger dataset called AI. It's not the only way though. We don't understand all the different ways human problem-solving works. We can't even agree taxonomy/naming conventions.

bought Ṁ10 of NO

Love this question. Some good solvers I know use an algorithmic approach: find a group of more than four, then identify which of them has another meaning and fits a different group. Should be programmable.

Me, I don't believe in intuition but what I do is what looks like intuition to an observer. I look at the whole grid at once, groups pop out, and when I count 4 and only 4 in a group, I enter it. My best time is 59 seconds for one where I made no mistakes and actually saw the common bond among the last four not just process of elimination. Usually more like 1:30 to 2:00 unless I have no hope of getting the last four because it's a pop culture reference I don't have. I don't know how to model that process as an algorithm but I would very much like to.

bought Ṁ11 of NO

I love this game and this would be a massive feat, especially since they're really good at choosing words with an irrelevant relation to one another to throw you off the less obvious/correct answer.

predicts NO

@shankypanky Often there are red herrings in the top left quadrant -- two adjacent words that mean something in context but that phrase isn't relevant to the solution. Like LITTLE PRINCE or MATCH BOX.

bought Ṁ22 of YES

@ClubmasterTransparent between this comment and your other one above you have me thinking about the different ways I approach Connections, actually. I'm more in your approach (though because I do the daily crossword, I can't help but read into the various meanings of words since it's fundamental there, which helps). there's a similarity between the two, I think, that over time you start to "get" how the puzzle creators think and the answers feel more apparent?
anyway I'm a big fan of word games and now I'm going to be approaching it with a different headspace when the next puzzle arrives so thanks for that.

predicts NO

@shankypanky aw thanks.

Comment hidden

More related questions