By the end of 2024, will there be an LLM prompt that can reliably solve the NYT Connections puzzle?

143

10kṀ140k

resolved Jan 17

Resolved

YES

ALL

Introduction

Connections is a unique, playful semantic game that changes each day. It occupies a fairly different space than most of the other games being effectively challenged by Large Language Models on Manifold and elsewhere, being at times both humorous and varyingly abstract. But, it does rely entirely on a simple structure of English text, and only features sixteen terms at a time with up to 3 failed guesses forgiven per day. If you're unfamiliar, play it for a few days!

I think Connections would make a good mini-benchmark of how much progress LLMs make in 2024. So, if a prompt and LLM combo is discovered and posted in this market, and folks are able to reproduce its success, I will resolve this Yes and it'll be a tiny blip on our AI timelines. I will need some obvious leeway for edge cases and clarifications as things progress, to prevent a dumb oversight from ruining the market. I will not be submitting to this market, but will bet since the resolution should be independently verifiable.

Standards

-The prompt must obey the fixed/general prompt rules from Mira's Sudoku market, excepting those parts that refer specifically to Sudoku and GPT-4.

-The information from a day's Connections puzzle may be fed all at once in any format to the LLM, and the pass/fail of each guess generated may be fed as a yes/no/one away as long as no other information is provided.
-The prompt must succeed on at least 16 out of 20 randomly selected Connections puzzles from the archive available here, or the best available archive at the time it is submitted.

-Successful replication must then occur across three more samples of 20 puzzles in a row, all of which start with a fresh instance and at least one of which is entered by a different human. This is both to verify the success, and to prevent a brute force fluke from fully automated models.

-Since unlike the Sudoku market this is not limited to GPT-4, any prompt of any format for any LLM that is released before the end of 2024 is legal, so long as it doesn't try to sneak in the solution or otherwise undermine the spirit of the market.

Update 2024-12-12 (PST): - The LLM only needs to correctly group the 16 words into their respective groups of 4. It does not need to identify or name the category labels for each group. (AI summary of creator comment)

Update 2025-01-01 (PST) (AI summary of creator comment): - Independent verification: Success must be confirmed by multiple traders using separate instances of the LLM.
- Consistent prompt usage: The same prompt must be utilized across different users to achieve successful puzzle solving.
- Resolution timeline extension: Resolution may be delayed until the end of January to accommodate verification processes.

Technology

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ4,236
2		Ṁ3,368
3		Ṁ1,398
4		Ṁ1,288
5		Ṁ925

People are also trading

Will an LLM consistently create 5x5 word squares by 2026?

84% chance

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

75% chance

Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?

85% chance

Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2026

64% chance

Will any LLM be able to consistently play Akinator correctly as the user by 2028?

79% chance

Will any LLM produce a reasonable poker simulation, as judged by Nate Silver, by the end of 2028?

Sort by:

o1 hasn't failed at any of the ~dozen I've tried over the past weeks (a handful of times it's made a reasonable enough first guess and then needs a second).
No particularly special prompting - literally just 'solve this NYT connections', and then if it fails, '[A, B, C, D] was correct. [W, X, Y, Z] was incorrect.'

https://mikehearn.notion.site/155c9175d23480bf9720cba20980f539?v=77fbc74b44bf4ccf9172cabe2b4db7b8 has it getting o1 pro 14/15 on the first attempt [and o1 got it on the second attempt for the one it missed when I tried, for 15/15]. He's "stopped tracking these because the results ended up pretty clear. o1 Pro is nearly perfect".

The archive above got taken down, but to get to 16, I just tested today's, which o1 got. This is obviously playing a bit fast and loose with 'random' (and the assumption that o1 pro >> o1; although I could quickly feed o1 the couple that it failed to ensure it does get them within the subsequent allowed trys if people doubt this), but at this point I'm 99%+ confident that o1 (let alone o1 pro/o1 high) would meet the bar for this market.

If push comes to shove I'm happy to feed in 16-20 randomly selected connections into o1 if someone's an NYT pro subscriber or whatever and can get their hands on them. But I'm curious if any no-holders are holding out there's a chance that o1 actually can't meet this?

@CalebW As a matter of principal, if you can hand off this combination to a second party to confirm it, I will resolve Yes. I personally haven't used o1 and haven't held in this market in some time.

@Panfilo This is not limited to Caleb, by the way. Any @traders who can verify that they and a separate instance run by a separate person can both get a net success with the same prompt using an LLM released in 2024 (or earlier). I see multiple people confidently using o1, but for the resolution criteria to be met, we do need that redundant success with the same prompt. Just a technicality, but an important one. I am willing to hold off resolution for all of January in case folks are busy, but the market will remain closed for consistency.

@Panfilo I just tried o1 with Connections (1/1/25 edition) and was impressed to see it perfectly solve the puzzle. But my prompt was a little weird. Could we maybe create a template for a prompt and then all try it on the same day (starting on 1/2/25 for example) and share our findings? Want to propose a simple template?

@Panfilo The question isn't whether "an LLM released in 2024 (or earlier)" can do this. It's whether a "prompt" can do this by the end of 2024.

Verification should only be happening using prompts known to have been written before, well, today — not LLMs released before today.

@jpoet If you need a standardized prompt that was written before January 1, here's one of my old chats: https://chatgpt.com/c/673fd241-f174-8010-830b-9a98eaafea80

```
Sort these words into 4 categories with 4 words each.

PLAY BAY STIR CHAIN TREE STREAM BARK RUN HOWL GARNISH AIR PYRAMID MUDDLE LADDER SNARL STRAIN
```

Here's a screenshot that proves that the chat was on November 21:

@jpoet Yes, the original description is still the standard, I was just prompting (lol) folks to actually show their work.

@jpoet @Panfilo to make my position on resolution clearer:
- given the evidence I laid out above and the current market price, I think this should presumptively resolve yes absent counterevidence
- i view myself as having planted the flag for the following prompting strategy: 'just 'solve this NYT connections [puzzle follows]', and then if it fails, '[A, B, C, D] was correct. [W, X, Y, Z] was incorrect.'' etc
- with this said, turns out I can access the NYT archive, so I just would need agreed upon dates to run

In the meantime, here's o1-mini oneshotting today's: https://chatgpt.com/share/6786a275-12d8-8009-907e-200f8189624a

@CalebW Well my position is still that we need a direct duplication and it sounds like it shouldn't be that tough now that o1 seems to have overshot the necessary intelligence. @CDBiddulph @jpoet @Jx Would any of you three volunteer to work with CalebW to resolve? I want to be crystal clear that I will be sticking to this standard and there are just over two weeks remaining to confirm it!

@Panfilo Sure. @CalebW If you have access to the NYT archive, could you send a pastebin or something containing a list of 20 puzzles? To make it effectively "random" without going too far back into the past (to avoid the possibility that the puzzles ended up in the training data), how about the puzzle every third day starting from today? January 14, January 11, January 8, etc. I'll post o1's responses after I run them all, so everyone can see that the puzzles are the correct ones. Does that sound reasonable to both of you?

@CDBiddulph Quick and dirty copy and paste: https://docs.google.com/document/d/1ZrqHh1gWRmwyE4GuTIZ8-chH2_S-Zf2xxs7NaX4a95A/edit?usp=sharing

Presumably ChatGPT w/ code interpreter or whatver can clean up and randomize them (out of the correct order) as needed 😁

ChatGPT - Word Categorization Puzzle

Shared via ChatGPT

Kind of interesting how o1 always begins its response with "Here's one neat way to..."

https://mikehearn.notion.site/155c9175d23480bf9720cba20980f539?v=77fbc74b44bf4ccf9172cabe2b4db7b8

bought Ṁ5,000 YES

@Panfilo sorry if this is answered somewhere else but tbc, this market is about whether the AI can place the 16 words in the right groups of 4, not whether they can guess the name of the categories on top of that?

@Bayesian No, they do not need to name the categories.

bought Ṁ2,300 YES

https://livebench.ai/

didn't it get trained on connections though?

@Bayesian You think it trained on the LiveBench ones? Maybe people could test the last few connections game to see

@Bayesian Yeah I guess it might have been trained on them

@EliLifland Looks like they're from April-May https://huggingface.co/datasets/livebench/language/viewer/default/test?q=connections

@Bayesian I tried it on today's game which it couldn't have been trained on and it got it using the simplest prompt possible (second message https://chatgpt.com/share/66e47309-63ec-8007-85af-324a7a9b3310 )

ChatGPT

ChatGPT helps you get answers, find inspiration and be more productive. It is free to use and easy to try. Just ask and ChatGPT can help with writing, learning, brainstorming and more.

sold Ṁ877 NO

@EliLifland I'm impressed!

@EliLifland I am bullish on the market overall but this particular model would have a hard time passing the test as written, since randomly selecting one of the puzzles from April-May would count as (unintentionally) backdooring in solutions.

bought Ṁ300 YES

The new o1 models would count here, right?

https://openai.com/index/learning-to-reason-with-llms/

@HakonEgsetHarnes If it can solve the connections puzzle, then yes

@Bayesian Cool. In my test it oneshots 4/5 from the archive given an example solution, so if it get's 3 attempts, some more context and a bit of prompt massaging, I think it will easily do this.

People are also trading

Will an LLM consistently create 5x5 word squares by 2026?

84% chance

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

75% chance

Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?

85% chance

Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2026

64% chance

Will any LLM be able to consistently play Akinator correctly as the user by 2028?

79% chance

Will any LLM produce a reasonable poker simulation, as judged by Nate Silver, by the end of 2028?

54% chance

Introduction

Standards

🏅 Top traders

People are also trading

People are also trading

Related questions