Introduction
Connections is a unique, playful semantic game that changes each day. It occupies a fairly different space than most of the other games being effectively challenged by Large Language Models on Manifold and elsewhere, being at times both humorous and varyingly abstract. But, it does rely entirely on a simple structure of English text, and only features sixteen terms at a time with up to 3 failed guesses forgiven per day. If you're unfamiliar, play it for a few days!
I think Connections would make a good mini-benchmark of how much progress LLMs make in 2024. So, if a prompt and LLM combo is discovered and posted in this market, and folks are able to reproduce its success, I will resolve this Yes and it'll be a tiny blip on our AI timelines. I will need some obvious leeway for edge cases and clarifications as things progress, to prevent a dumb oversight from ruining the market. I will not be submitting to this market, but will bet since the resolution should be independently verifiable.
Standards
-The prompt must obey the fixed/general prompt rules from Mira's Sudoku market, excepting those parts that refer specifically to Sudoku and GPT-4.
-The information from a day's Connections puzzle may be fed all at once in any format to the LLM, and the pass/fail of each guess generated may be fed as a yes/no/one away as long as no other information is provided.
-The prompt must succeed on at least 16 out of 20 randomly selected Connections puzzles from the archive available here, or the best available archive at the time it is submitted.
-Successful replication must then occur across three more samples of 20 puzzles in a row, all of which start with a fresh instance and at least one of which is entered by a different human. This is both to verify the success, and to prevent a brute force fluke from fully automated models.
-Since unlike the Sudoku market this is not limited to GPT-4, any prompt of any format for any LLM that is released before the end of 2024 is legal, so long as it doesn't try to sneak in the solution or otherwise undermine the spirit of the market.
@Bayesian You think it trained on the LiveBench ones? Maybe people could test the last few connections game to see
@EliLifland Looks like they're from April-May https://huggingface.co/datasets/livebench/language/viewer/default/test?q=connections
@Bayesian I tried it on today's game which it couldn't have been trained on and it got it using the simplest prompt possible (second message https://chatgpt.com/share/66e47309-63ec-8007-85af-324a7a9b3310 )
@EliLifland I am bullish on the market overall but this particular model would have a hard time passing the test as written, since randomly selecting one of the puzzles from April-May would count as (unintentionally) backdooring in solutions.
The new o1 models would count here, right?
https://openai.com/index/learning-to-reason-with-llms/
@Bayesian Cool. In my test it oneshots 4/5 from the archive given an example solution, so if it get's 3 attempts, some more context and a bit of prompt massaging, I think it will easily do this.
Claude 3.5 Sonnet with this system prompt:
You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.
And this prompt:
Solve today’s NYT Connections game. Here are the instructions for how to play this game: Find groups of four items that share something in common. Category Examples: FISH: Bass, Flounder, Salmon, Trout FIRE _: Ant, Drill, Island, Opal Categories will always be more specific than ’5-LETTER-WORDS’, ‘NAMES’, or ’VERBS.’ Example 1: *Words: [’DART’, ‘HEM’, ‘PLEAT’, ‘SEAM’, ‘CAN’, ‘CURE’, ‘DRY’, ‘FREEZE’, ‘BITE’, ‘EDGE’, ‘PUNCH’, ‘SPICE’, ‘CONDO’, ‘HAW’, ‘HERO’, ‘LOO’] Groupings: 1. 1. Things to sew: [‘DART’, ‘HEM’, ‘PLEAT’, ‘SEAM’] 2. 2. Ways to preserve food: [CAN’, ‘CURE’, ‘DRY’, ‘FREEZE’] 3. 3. Sharp quality: [’BITE’, ‘EDGE’, ‘PUNCH’, ‘SPICE’] 4. 4. Birds minus last letter: [’CONDO’, ‘HAW’, ‘HERO’, ‘LOO’] Example 2: Words: [’COLLECTIVE’, ‘COMMON’, ‘JOINT’, ‘MUTUAL’, ‘CLEAR’, ‘DRAIN’, ‘EMPTY’, ‘FLUSH’, ‘CIGARETTE’, ‘PENCIL’, ‘TICKET’, ‘TOE’, ‘AMERICAN’, ‘FEVER’, ‘LUCID’, ‘PIPE’] Groupings: 1. 1. Shared: [’COLLECTIVE’, ‘COMMON’, ‘JOINT’, ‘MUTUAL’] 2. 2. Rid of contents: [’CLEAR’, ‘DRAIN’, ‘EMPTY’, ‘FLUSH’] 3. 3. Associated with “stub”: [’CIGARETTE’, ‘PENCIL’, ‘TICKET’, ‘TOE’] 4. 4. Dream: [ ’AMERICAN’, ‘FEVER’, ‘LUCID’, ‘PIPE’]) Example 3: Words: [’HANGAR’, ‘RUNWAY’, ‘TARMAC’, ‘TERMINAL’, ‘ACTION’, ‘CLAIM’, ‘COMPLAINT’, ‘LAWSUIT’, ‘BEANBAG’, ‘CLUB’, ‘RING’, ‘TORCH’, ‘FOXGLOVE’, ‘GUMSHOE’, ‘TURNCOAT’, ‘WINDSOCK’] Groupings: 1. 1. Parts of an airport: [’HANGAR’, ‘RUNWAY’, ‘TARMAC’, ‘TERMINAL’] 2. 2. Legal terms: [‘ACTION’, ‘CLAIM’, ‘COMPLAINT’, ‘LAWSUIT’] 3. 3. Things a juggler juggles: [’BEANBAG’, ‘CLUB’, ‘RING’, ‘TORCH’] 4. 4. Words ending in clothing: [’FOXGLOVE’, ‘GUMSHOE’, ‘TURNCOAT’, ‘WINDSOCK’] Categories share commonalities: • There are 4 categories of 4 words each • Every word will be in only 1 category • One word will never be in two categories • As the category number increases, the connections between the words and their category become more obscure. Category 1 is the most easy and intuitive and Category 4 is the hardest • There may be a red herrings (words that seems to belong together but actually are in separate categories) • Category 4 often contains compound words with a common prefix or suffix word • A few other common categories include word and letter patterns, pop culture clues (such as music and movie titles) and fill-in-the-blank phrases. Often categories will use a less common meaning or aspect of words to make the puzzle more challenging. Often, three related words will be provided as a red herring. If one word in a category feels like a long shot, the category is probably wrong. You will be given a new example (Example 4) with today’s list of words. Remember that the same word cannot be repeated across multiple categories, and you will ultimately need to output 4 categories with 4 distinct words each. Start by choosing just four words to form one group. I will give you feedback on whether you correctly guessed one of the groups, then you can proceed from there.
Here is the word list: [<Word list goes here>]
Seems to perform quite solidly on this.
@AardvarkSnake This is promising! A different approach than some of what we've seen before. Do you want to make an attempt? I can select the random sample or you can, and if it succeeds we can have a third party verify it. Or you can wait to try to make more mana, especially if you do the test privately and it already works 🥳
@dominic
https://connections.swellgarfo.com/nyt/444
🟨🟨🟨🟨
🟩🟪🟦🟩
🟪🟪🟦🟦
🟩🟪🟩🟩
🟪🟩🟩🟩
🟩🟪🟩🟩
🟩🟩🟩🟩
🟦🟦🟦🟦
🟪🟪🟪🟪
https://connections.swellgarfo.com/nyt/445
🟩🟦🟨🟪
🟦🟦🟪🟦
🟦🟪🟦🟪
🟩🟩🟩🟩
🟨🟨🟨🟨
🟦🟦🟪🟦
🟦🟦🟦🟪
🟦🟦🟦🟪
🟦🟪🟦🟪
🟦🟦🟪🟪
🟦🟪🟦🟪
🟦🟦🟪🟪
🟦🟦🟪🟪
🟦🟦🟦🟪
🟦🟦🟦🟪
🟦🟪🟦🟪
🟦🟪🟦🟪
🟦🟦🟪🟪
🟦🟦🟦🟪
🟦🟦🟦🟦
🟪🟪🟪🟪
Observations: Sonnet was substantially worse than 4o in my opinion in two ways:
1. It didn't track which words it had used already, resulting in many illegal guesses (not captured in the logs above). gpt-4o also does this a bit, but much less
2. It doesn't appear to recognise for itself when it has 'won' the game and just keeps on guessing (gpt-4o recognises that it has won once it gets 4 groups correct without you having to explicitly tell it)
(I used the same prompt for both)
@draaglom Interesting that it is so much worse. I wonder if the right prompt could get it to be better - it does have decent reasoning capability in other ways.
@dominic Part of what attracted me to this as a benchmark is that Connections is already trying to trick humans' ability to predict a pattern, rather than merely challenge it directly. The most conventionally likely next token being wrong or 50/50 ambiguous is kind of a core element of the puzzle (at least on difficult days)!
@Panfilo True, but it's also a puzzle all about guessing which words are kind of related to other words, something that LLMs are generally decent at. For example the Sudoku market required a super long and convoluted prompt because solving sudoku requires a lot of logical reasoning steps. Connections requires a lot less of that explicit logical reasoning, and the difficulty comes from just figuring out which words could be related (though there is some difficulty from the overlap of categories as well)
@traders For those not following me and thus perhaps out of the loop on recent action, this market has been upgraded to PLUS!
Could you link me to where its category of AI is described? Is it considered some other branch of machine learning? If Strawberry allows a prompt to be entered to an LLM that beats the market’s standards, that would obviously count for a Yes. Is it expected to work on a totally different input system? Can it work without an LLM? Do we know any of these answers yet?
I don’t know the specifics but apparently it does something like multi-step reasoning by learning to search the space of possible multi-step plans. It uses an LLM as a tool but isn’t itself an LLM, possibly. My understanding of even the rumors is very surface level so take that with a massive grain of salt but there is a potential for some Ambiguity there, i’d guess
Like, maybe there’s model B that uses model A as a tool-world-model-sampler, where model B does the multi-step planning, and the advanced capabilities like solving connections comes from joining the two. Then it isn’t completely clear whether it counts as within the category of the first model or not. Ig you could define what you mean by LLM so people who have different estimates ab how much of an LLM strawberry might be, can bet on that. Up to you tho, and if someone knows more than me pls correct me
@TimothyJohnson5c16 I previously said that fine-tuning in general is okay, but backdooring in answers under the pretext of fine-tuning is not (ie. having some sort of behavior where the LLM recognizes one of X extant grids and spits out the sets associated with that grid because they were trained to recognize them as a backlog of paired wholes). Notably, GPT 3.5 Turbo did not come close to meeting the resolution criteria after being fine-tuned on Connections, as linked in a thread below.