Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)

3.2k

33kṀ3.2m

resolved Jan 17

Resolved as

50%

ALL

Some interesting results: GPT4 is bad at listing out all numbers in a row. It consistently skips the numbers 1 and 7. GPT4 is very good at transposing a matrix (such that rows are columns, and columns rows) and has never messed up on this task. Because of these reasons I'm going all in. It seems that tasks at which it does badly on can likely be worked around via clever prompting.

+7%

Showcasing this market because it is a novel attempt to measure current AI capabilities and includes a detailed operationalization.

+10%

Catnee submits a candidate solution

+10%

solution does not fully work (editing so no one gets baited)

+9%

edit - removing this, not good enough.

+10%

I solved a Sudoku using 75 GPT-4 API calls (8k context). There are some ways my approach will need to be adjusted to satisfy the resolution criteria. Most importantly, it will need to be modified to use 32k context to get under the 50-turn limit. I describe everything in detail here: https://strange-prompts.ghost.io/i-taught-gpt-4-to-solve-sudoku/ [link preview]

+20%

Does gpt4v count? https://twitter.com/roytomhermann/status/1706861232152621320

+17%

I did it. I DID it. I spent the week on this due to ADHD hyperfocus, but I've made a prompt which has successfully solved a recent LA Times easy 9x9 Sudoku. My prompt has solved the puzzle from October 3rd. I put no effort into the logging function (I regret nothing), so reading it might be interesting, but it's all there. The prompt and solution log are here: https://github.com/iamthemathgirlnow/sudoku-challenge-solution/tree/main Note, it is Long, and it is messy, but it is beautiful. It took 39 turns, $26.92 in API fees to develop, an additional $12.12 to run the solution, and 40 hours of pure joy to craft. I ran the solution on the API with the 8k gpt-4-0613 on temperature 0. Thank you Mira for the challenge ☺️ [link preview]

+35%

Hello all. I have an update on @EmilyThomas ' solution. This covers the October 3 puzzle that looks like: _ _ _ | 3 9 4 | 6 5 _ _ 6 _ | _ _ _ | _ _ 3 _ _ 8 | 1 5 _ | _ _ _ -+------+------+------ _ 3 9 | _ _ 7 | _ _ _ 4 5 7 | _ _ 2 | _ 6 _ 8 _ _ | 9 _ _ | _ 1 4 -+------+------+------ _ _ _ | _ _ _ | _ 8 _ 9 _ _ | _ 6 1 | _ _ _ _ 1 5 | 2 8 _ | _ 4 6 I've added @EmilyThomas ' prompt to my judging script( https://github.com/Mira-public/manifold-sudoku/blob/main/solutions/emily/emily.py ) and a transcript in Emily's logfile format for easy diff'ing. My transcript diverges from Emily's, never solves the puzzle, and puts the puzzle into an unsolvable state by incorrectly writing a cell. The first minor difference is on line 420 where Emily's original transcript has a slight error including "5" in the intersection of two sets. STEP_B: [1,2,4,5,7,8,9].intersection([1,2,4,7,8,9]) = [1,2,4,5,7,8,9] My replication GPT-4 doesn't make this mistake, and it could be caused by the line "For each row we do the following calculations:" from Prompt A being missing in Emily's transcript. Possibly a copy-paste error. It ends not mattering, because the response is truncated identically. The second difference - which does affect the future trace - is that on line 790 of the original transcript, GPT-4 correctly excludes 6 as a candidate on row 2 column 6, while on line 755 of my replication GPT-4 includes it as a candidate. This means Emily's GPT-4 is able to place an 8 in that cell, while mine remains undecided between 6 and 8. (The line numbers differ only because the first 35 lines of Emily's transcript are redundant.) However, in my replication, if you scroll to the very end of the file it does eventually place an 8 in that cell. So while this is a divergence and a missed opportunity, it is not the first incorrectly-placed cell. The first incorrect placement is between line 12513 and line 13000 where the line "[1,3,9,6,4,7,5,2,0]" should be "[1,3,9,6,4,7,8,2,5]". This is caused by: # STEP_ONE Extracting Previous Results square(first_row, seventh_column) first_row_candidates = [2,5,8] seventh_column_candidates = [1,2,4,5,8] top_right_block_candidates = [5,6,7] where 8 is incorrectly excluded from the top-right block candidates, leading to 5 being committed to the cell. I'm not entirely sure where it gets the [5,6,7] from. line 12527 looks overly-truncated. prompt 5 of 7: ,7] middle_left_block_missing_elements = [6,7,8] square(seventh_row, first_column) seventh_row_missing_elements = [2,8] first_column_missing_elements = [2,3,6,7] bottom_left_block_missing_elements = [2,3,4,9] square(seventh_row, ninth_column) seventh_row_missing_elements = [2,8] ninth_column_missing_elements = [1,5,7,8] bottom_right_block_missing_elements = [1,4,8] square(eighth_row, third_column) eighth_row_missing_elements = [1,4] third_column_missing_elements = [4] bottom_left_block_missing_elements = [2,3,4,9] square(ninth_row, first_column) ninth_row_missing_elements = [3,4,9] first_column_missing_elements = [2,3,6,7] bottom_left_block_missing_elements = [2,3,4,9] square(ninth_row, second_column) ninth_row_missing_elements = [3,4,9] second_column_missing_elements = [8,9] bottom_left_block_missing_elements = [2,3,4,9] square(ninth_row, seventh_column) ninth_row_missing_elements = [3,4,9] seventh_column_missing_elements = [1,2,4,5,8] bottom_right_block_missing_elements = [1,4,8] The last mention of "top_right_block_candidates" is "block = top_right_block_candidates = [1,2,4,8]" which does include an 8. The above is exactly 300 tokens as instructed here . So it's possible that GPT-4 hallucinated some cells that were truncated(or copied them from the examples), and that raising the truncation tokens will cause this error to go away. But Emily's prompt also uses most of the 8k context, so adding in another couple dozen might need some tokens removed elsewhere. You can replicate my log transcript using the command(from my Github's judging script): python main.py run-prompt --prompt emily-1 --puzzle 000394650060000003008150000039007000457002060800900014000000080900061000015280046 --max-output-tokens=8192 --max-turns 50 --max-transitions 1000 --log-style emily and then the "outputs/" folder will have a "sudoku_log" file in emily's logging format. It can take up to 2 hours to finish a single test. So @EmilyThomas , if this diagnosis sounds right to you, then it's probably going to have similar token truncation issues on other puzzles. So you might want to do a little bit of optimization to get this one running reproducibly before we extend testing out to more puzzles. I haven't tried running the prompt on any others yet. Puzzle shortstring: "000394650060000003008150000039007000457002060800900014000000080900061000015280046"

-13.0%

Two Successful Solves. I had to fix so many things. I thought it would take a few hours at most, then I could spend the next few days making everything neat and tidy. Nope. I mean sure, the original error was no problem. I just changed a few lines and it went away. ...but then it failed the test run. Okay, a new type of error that hadn't shown up before, no worries, I've got time. Another fix. Another failed run. Another new type of error. This happened multiple times, causing a total of nine failed runs. It actually took a rant to GPT-4 to get the insight I needed. ChatGPT thinks everything is zero indexed. I mean, it can count starting from one, it just... really likes starting at zero. And that did it. Prompt fixed, errors reduced enough for entire solves to finish. Two Successful Solves. I know I said I'd do three, but I ran out of API 😅, so two will have to do. I'd decided in advance to use the puzzles from October 3rd, 4th, and 5th. - Oct 3rd - Solved! In 32 turns - Oct 4th - Not attempted by the final prompt due to not enough API - Oct 5th - Solved! In 44 turns The October 3rd puzzle ran like clockwork Most of my testing during development was on this puzzle though, so that's probably why. It's also why I knew I needed a solve on another puzzle. Looking at the The October 4th puzzle (and the few failed runs on it during development) I knew my prompt would... struggle. Maybe it would get it, but maybe not quite in 50 turns. So I'm declaring that one 'failed' for the purposes of testing my updated prompt. I had enough API left for a single full run, so I put all my hopes on October 5th. I had not done any testing on the October 5th puzzle. If I could have cherry picked any puzzle to use for this run it... wouldn't have been my first choice, but still looked solvable. The October 5th puzzle was tricky during the run, but it got there eventually. Two out of Three puzzles solved. It cost: - $115.83 total in API fees - Two weeks of effort - And a tiny piece of my sanity But I did it

+20%

Emily's already noticed, but I did test the second prompt against 8 puzzles: 7 from the LA times and 1 from the NY times. Transcripts are here. None of them succeeded. Many failed in the first couple turns, though there a couple runs that made it 30+ steps in before making an error. Errors I noticed include: Off-by-one on the column index when writing a cell. e.g. "I need to update the fourth column, which is index 3" and then writes the 3rd column. *** NY Times October 17 puzzle Puzzle: 075396000000050209968000057430600810600543000009100603007005026096002030500061070 ## 3 of 3 second_row: [6,0,0,5,4,3,0,0,0] cell(second_row, third_column): [1] The third_column is index 2 of second_row second_row[2] == 0, so we can proceed. Updating index 2 to 1 let second_row[2] = 1 now second_row is [6,1,0,5,4,3,0,0,0] # INCORRECT - Inserted the 1 at index 1, not index 2 *** LA Times October 21 puzzle Puzzle: 000000083000000000108300470030810900020900068059004007000080006370060024080240030 ## 2 of 2 third_row: [8,5,9,6,0,4,0,0,7] cell(third_row, eighth_column): [1] The eighth_column is index 7 of third_row third_row[7] == 0, so we can proceed. Updating index 7 to 1 let third_row[7] = 1 now third_row is [8,5,9,6,0,4,1,0,7] # INCORRECT - Inserted the 1 at index 6, on the 7th-column, not index 7. A duplicate row when copying: *** LA Times October 20 puzzle: Puzzle: 018090050030100640006035008704513900000270010052000034205060001691000000003951006 4 1 8 | 7 9 _ | _ 5 _ _ 3 _ | 1 _ _ | 6 4 _ _ _ 6 | _ 3 5 | _ _ 8 -+------+------+------ 7 _ 4 | 5 1 3 | 9 _ 2 _ _ 9 | 2 7 _ | _ 1 _ _ 5 2 | _ 8 _ | _ 3 4 -+------+------+------ 2 _ 5 | _ 6 _ | _ _ 1 6 9 1 | _ _ _ | _ _ _ 8 _ 5 | _ 6 _ | _ _ 1 # INCORRECT - Copying 2 rows above Substituting the wrong variable *** LA Times October 22, 2023 Puzzle: 006700240000080006080200007000600081204978650800001902500000000100869000028450709 ### Block missing elements - Clearly listed top_left_block_missing_elements: [3,4,6,7,9] top_middle_block_missing_elements: [2,7] top_right_block_missing_elements: [1,2,3,4,5,6,8] middle_left_block_missing_elements: [1,2,3,4,5,7,9] middle_middle_block_missing_elements: [1,3,4,5,6,9] # TARGET VARIABLE middle_right_block_missing_elements: [1,3,5,8,9] bottom_left_block_missing_elements: [3,5,6,7,9] bottom_middle_block_missing_elements: [2,3,4,5] # INCORRECTLY CHOSEN VARIABLE bottom_right_block_missing_elements: [] [...] #### 17 of {max_cells} (fourth_row, fifth_column): import fourth_row_missing_elements, fifth_column_missing_elements, middle_middle_block_missing_elements STEP_A: [1,3,5,8,9].intersection([1,2,3,4,9]) = [1,3,9] STEP_B: [1,3,9].intersection([2,3,4,5]) = [3] # INCORRECT SUBSTITUTION HERE common_missing_elements: [3] Making no progress. If the puzzle ever remains the same after a "Prompt A -> Prompt B" two-turn cycle, then it will just repeat indefinitely. I think one of the transcripts does this(because I cache the responses I don't pay to generate the same response every time). I think this is just the puzzle being too hard for the technique in the prompt. Prompt A tries to exceed 8k tokens(input+output). Sometimes GPT-4 will try to analyze too many cells and generate a massive output that's so large it would push the instructions out-of-context. Nondeterminism. Very large outputs(such as the large Prompt A ones) when rerun will often succeed the next time. This shows that GPT-4 is generating nondeterministic outputs even at temperature 0, especially for larger outputs. Regarding the very large outputs, I'm wondering if I should consider it a "failure to return a response", since GPT-4 isn't choosing to end the response by generating an <|im_end|> token, and allow a retry just the same as a failed API call. Especially since it's non-deterministic and the second try usually returns something reasonable. Regarding non-determinism: For the "will GPT-4 solve any puzzle at all?", I'll allow retries to count. But for this main market, I will only test each puzzle once and whatever result will be counted YES/NO. Even retrying API-terminated responses didn't get me a successful solve though. I was considering running the prompt #1 for 10 puzzles, that had only made a single mistake when I tested it, because I really wanted to get you guys a successful solve(especially after this market got pushed up to 90%...). But I think I'm going to wait until November 6(OpenAI Devday!) for possible price/performance improvements.

-10.0%

Third time is most definitely the charm. 4 out of 5 successful solves, (for a score of 4 out of 6, it'll make sense I promise). I asked discord for a random November day, and Mira gave Nov 6th (because Dev Day). This day was to be the start of the testing. Instead of 1 or 2 puzzles, I chose in advance to test on 5. Also, since this is still the same method as last time (more on that in a bit), just made better, it can still only solve 70 to 85% of puzzles at most (depending on luck). That is fine, everything in this one is needed for the next. In fact I was trying to skip straight to version 4, but at some point realized I had all the pieces I needed for this one and sort of just... stitched it together. So! The Results!! Nov 6th: Solved in 26 turns which was one turn fewer than should have been possible and I don't know what the heck happened but whatever it was it worked. But still, solved. Nov 7th: Solved in an easy 20 turns, 2 turns over the (supposed to be) minimum of 18 for this method. Nov 8th: Solved in 34 turns, 2 more than the minimum, and the first time I've had 3 solves in a row :) Nov 9th: Solved in 24 turns, 2 more than the minimum. Nov 10th: Declared failed without attempting. Would have gotten to turn 36 and gotten stuck. Nov 11th: Sad times. Should have been an easy 20, failed on turn 10. The worst part? It wasn't an error of logic that got it, it was my instruction writing for how to display the output format. An error that required a few things to go wrong at once and is unlikely to come up often (or be seen again at all), a silver lining that the output it tried to write had been calculated perfectly. I Am So Glad This Worked!! GPT-4 is a fickle beast of Ever Changing Mind. The cost of stability is tokens, and tokens are precious and rare... Or at least they were until we got a One Hundred And Twenty Eight Thousand Token Model!!! I... cannot stress enough how welcome this was. Trying to do this with 8k became... just one constraint too many to be comfortable. I do not envy the me that was trying to do this with the 8k model... It would be doable with the 32k if one has access, and it is technically doable with the 8k model it's just... a lot. Now on to Version 4!! Version 4 is going to happen. One way or another we're getting 90+ percent on this challenge. A YES resolution is Going to happen. (technically this prompt can get 80% if January has even just 2 or 3 easier puzzles than October, especially if given minor fixes after seeing it do 5 whole runs, but that is plan D). However, since there is only a month left, it makes sense to open this up more to anyone who wants to help, even if it's just to bounce ideas around (or if you want to help more, feel free!). Mostly this means I'll be more liberally using the discord instead of being all secretive to try and get the first solve (which, I'll admit, was not a small part of my motivation 😅). Someone also had the idea to try for funding through manifund for the API fees (this challenge so far has cost as many dollars as it has hours, and I'm up to the hundreds in both), especially since the final version will cost more per run, and will take more testing to finish. I know How version 4 will work, and before this had Finally gotten the hard part of it working. The rest is just a lot of iterative engineering. Completely doable, just takes time and testing.

+24%

@Mira I believe Emily's prompt has solved the December 7 puzzle, and will resolve 3 of the secondary markets. Here's a transcript for the December 7 LA Times puzzle: https://github.com/Mira-public/manifold-sudoku/blob/main/transcripts/emily-3/emily-2.034720009608000250700900003910000765020300800070009300080001504001003000509402030.sudoku_log.txt On the 16th turn, it outputs: <output> RowOne: [3,8,7,6,9,1,5,2,4] RowTwo: [2,4,1,8,5,3,9,7,6] RowThree: [5,6,9,4,7,2,1,3,8] RowFour: [1,3,4,7,2,5,6,8,9] RowFive: [6,9,8,1,3,4,2,5,7] RowSix: [7,5,2,9,8,6,4,1,3] RowSeven: [9,1,3,2,4,8,7,6,5] RowEight: [4,2,5,3,6,7,8,9,1] RowNine: [8,7,6,5,1,9,3,4,2] </output> Even though my code says it is solved, I also punched it into a website to check and it says it's a valid solution. This is a permutation of the original puzzle, but I'm inclined to count it because parsing a solution should be the inverse of a Sudoku Template, which allows one to specify a permutation: A Sudoku Template is any string with exactly 81 substitution points. Such template can be combined with 81 digits 1-9 or a Placeholder value to produce a Rendered Sudoku. The placeholder can be any string - including "0", ".", or "_" - but must be a specific string and identical each time. The substitution points do not need to be in any specific order: An inverted or flipped puzzle would also be allowed by using a template with substitutions in inverted or flipped order. Emily's solution acts on a 6-cycle, where every 2 turns the rows are shifted by 3. So the permutation is deterministic, does not depend on the contents of the puzzle, and can be matched to one of 3 Sudoku Templates depending only on the turn number. So I believe this counts as a valid solution. December 3: Had an error on turn 5, where it wrote a row with only 8 columns December 4: Unsolvable by Emily's technique December 5: Unsolvable December 6: Unsolvable December 7: Solved. I will do some more testing, on at least 5 puzzles that aren't unsolvable, so I won't resolve the 3 "any puzzle" markets just yet. This will also give people time to review this transcript, to see if I missed something. Here are the 8 stages of this puzzle being solved using Emily's technique, printed at the end of the transcript: 1: 134720009698004250750900003913000765420300800070009300080001504001003000509402030 2: 134720009698004250750900003913008765420300801870009300080601504201003000509402030 3: 134720689698104250750900003913008765420300801870009300387601504201003000569402030 4: 134725689698134257752906003913208765420300801870009300387601504201003000569402030 5: 134725689698134257752906003913248765420307891870509302387691504201003000569402030 6: 134725689698134257752906003913248765420307891870509302387691524241803070569402138 7: 134725689698134257752986403913248765420367891876509342387691524241853070569402138 8: 134725689698134257752986413913248765425367891876519342387691524241853976569472138 These are intermediate Sudokus with more and more cells filled in, where the 81 cells are represented using 81 digits 0-9(0 being a placeholder).

+10%

Sudoku Update! (not a new submission, keep your pants on). I've done a lot of theorizing on how to get the sudoku method past the 80% mark. The method used so far can't get past 75% without a lot of luck, even if it makes no mistakes. Now there are plenty of sudoku techniques to choose from, almost any of which would push the method to 95%+ with no problems. The trouble is, most of them require finding the candidate digits of multiple cells, at the same time, with no mistakes, and then doing more analysis on those. That is... too error prone for poor little GPT-4. So we need something simpler. The only other viable method would be the twin of the current method. Instead of picking a cell and checking if there is only one number it can be (by looking at the row, column, and 3x3 box it is in), we instead choose a row, or a column, or a box, and see if there is only one place in it which a particular digit can go. So (for example) we would: Take a row. See that it doesn't yet have a 4. Check to see which of the empty cells in that row can, or cannot, be a 4. And if there is only one possible spot in the row which can be a 4, then that is where we put the 4. However, doing this for the rows, columns, and boxes would be too many tokens, and (I'd thought) would still require finding the cell candidates (accurately) to have any chance of doing it without going bankrupt. But I was not correct. Each row overlaps with 3 boxes and 9 columns. Each column overlaps with 3 boxes and 9 rows. This would make it too expensive to check enough rows or columns to finish in time. But, each box overlaps with only 3 rows and 3 columns. As well as that, checking the boxes using this method Finds Numbers Faster than checking the same number of rows or columns. And on top of that, if we only have to check the boxes, the method can be specialized in such a way that We Do Not Need The Cell Candidates. This makes it a perfectly viable method to use. So we have a new technique we can add, but does it give us enough of a boost to pass 80 percent? Starts Cackling Evilly Yes. Yes it does. I wrote a script to simulate the new method and tested it against the October, November, and (available) December puzzles (77 in total). Not only can this method solve 69/77 puzzles (89%) BY ITSELF (Without Even Using The Previous Method). But, when used with the previous method, there was only A SINGLE PUZZLE WHICH COULDN'T BE SOLVED (god dammit october 18th get with the program). The two methods combined are capable of solving 76/77 (98%) of the last 77 easy LA times puzzles. Now that I know how it will work I just need to build it. And yes, two weeks is enough time (version 1 only took a week and I started with Zero knowledge). In conclusion: We are so back :)

+10%

This market predicts whether GPT-4 will be able to solve "easy" Sudoku puzzles by December 31, 2023.

Resolution Criteria

Resolves YES if:

A fixed prompt is found(and posted in the comments) that enables GPT-4 to reliably solve freshly-generated easy-rated Sudoku puzzles from Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com ), using only its language modeling capabilities and context as memory.

Resolves 50% if:

A fixed prompt is found(and posted in the comments) that enables GPT-4 to occasionally solve Sudoku puzzles.

Resolves NO if:

No fixed prompt that enables GPT-4 to even occasionally solve easy-rated Sudoku puzzles using the specified conditions is posted in the comments by December 31, 2023.
OpenAI permanently shuts down GPT-4 access before any solutions are posted in the comments.

Resolves as NA if:

This market does not resolve NA.

Resources

Discord server: https://discord.gg/Y6qvtB5xPD
Github repository with solution judging script: https://github.com/Mira-public/manifold-sudoku
Manifold category for related markets: https://manifold.markets/questions?topic=gpt4-sudoku-challenge-2023

Definitions

GPT-4 refers to either ChatGPT's GPT-4, or any model named like GPT-4 exposed in OpenAI's API. "gpt-4-base", "gpt-4", and "gpt-4-32k" are currently-known model ids, but anything labeled GPT-4 would count including the upcoming image support. The API is preferable since setting temperature to 0 will allow the judge to replicate your responses, but if your prompt has a high success rate ChatGPT could also be accepted. See the definitions of "reliably" and "occasionally" below for details on computing the success rate if more precision is needed. Model must be produced by OpenAI, so finetuned variants would not count.
See "Related markets" below for variants that allow GPT-3.5, finetuned models, and that only need to solve a single puzzle.
Easy-rated Sudoku puzzle means a puzzle classified as easy by any reputable Sudoku site or puzzle generator. This market plans to use the LA Times(Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com )) for judging, but I maintain the option to use a different Sudoku generator.
Fixed-prompt means that everything except the Sudoku puzzle provided to GPT-4 remains the same. The prompt may provide GPT-4 with instructions, but these instructions must not change for each puzzle. A solution must be found within 50 turns. Multimodal support is allowed to be used. The operator cannot give information to GPT-4 beyond the initial puzzle, so their inputs must be static. (e.g. just saying "continue" if ChatGPT runs out of output space and stops).

Formal definition of Solution

A Sudoku Template is any string with exactly 81 substitution points. Such template can be combined with 81 digits 1-9 or a Placeholder value to produce a Rendered Sudoku. The placeholder can be any string - including "0", ".", or "_" - but must be a specific string and identical each time. The substitution points do not need to be in any specific order: An inverted or flipped puzzle would also be allowed by using a template with substitutions in inverted or flipped order.
- An image rendering of the initial puzzle would also be a valid Rendered Sudoku .
Chat Completion API entry is a pair (tag, message), where tag is one of "system", "user", "assistant", and message is any UTF-8 string. When multimodal GPT-4 is released, message can also be an image.
A Turn is a pair (entries, response), where entries is a list of Chat Completion API entries and response is the UTF-8 encoded string that GPT-4 generates.
A Transition Rule maps one list of entries to another list of entries, using the primitive operations:
- Remove entry at fixed index(from beginning or end)
- Insert a fixed message at a fixed index(from beginning or end).
- Insert a rendered Sudoku created from the initial Sudoku puzzle at a fixed index(from beginning or end). The fixed prompt is allowed to contain multiple renderings of the same puzzle.
- Insert the GPT-4 response to the input entry list to any fixed index(from beginning or end). You can use either the default GPT-4 response length(i.e. whenever it emits an <|im_end|> token), or can specify an exact token count up to the native context size of the model. It is allowed to make multiple API requests, and to retry requests that respond with errors, as long as the successful requests are all unconditionally concatenated into a single response and the inputs + response fits within the model's context. You cannot apply any other transition rules until the entire response is generated. The "output tokens" of the OpenAI don't matter - only the context size; so the 128k GPT-4 Turbo can be chunked to produce either a fixed number of tokens or the model can choose to stop at any point up to 128k.
  - Example: You have 2,000 tokens of input and are using the 32k model. If you specify "32,000" as your size here, you're allowed to keep querying the API sending the entire context + all previous responses until you get exactly 30,000 tokens of output. These should all be concatenated into a single entry.
  - Example: You're using GPT-4 Turbo which is a 128k context model, and have a 12k prompt. Using the "finish_reason" in the API response, the model would be allowed to generate up to 116k tokens using the maximum 4k output tokens each time.
  - The GPT-4 response can be tagged "user", "assistant", or "system" when later resent to GPT-4, as long as this choice doesn't depend on the message.
- Truncate an entry at a fixed token index(index is from beginning or end, and truncation can start from beginning or end). You can use characters for testing, but judging will use "cl100k_base" tokens.
A Fixed-prompt is any sequence of transition rules.
The Operator is the human or program that is executing a fixed-prompt against the OpenAI API.
Then a Solution for the purposes of this market is a fixed-prompt satisfying all of:
- "initial Sudoku puzzle" is bound to a specific rendered Sudoku.
- The transition rules are applied for 50 turns to get a maximum of 50 GPT-4 responses.
- The operator scanning for the first thing that subjectively looks like a solved Sudoku puzzle in those responses and then stopping, is able to input the solution into a Sudoku checking tool and confirms that it is a solution to the initial Sudoku puzzle.
  - "Subjectively looks like" refers to parsing a puzzle from a string into a normal form, and is approximately "turn number-dependent regular expression with named capture groups". I choose not to specify it because I'm not 100% sure what regex generalizations allow useful compute and want to retain the possibility of rejecting them, or of accepting isomorphic puzzle solves.

Examples

The simplest valid pattern is:

("User", <some initial prompt>)
("User", <provide puzzle>)
("Assistant", response 0)
("User", "continue") ;; or any other fixed input
("Assistant", response 1)
("User", "continue")
....
("User", "continue")
("Assistant", solution)

With at most 50 "Assistant" entries(50 turns). The only "dynamic" input here is entry #2 which has the puzzle, and the rest is ChatGPT's responses. So this counts as a "fixed prompt" solution. You're allowed to insert more prompts into the chain after the puzzle, as long as the decision to include them or their contents do not depend on the puzzle. For example, you might have a prompt that causes ChatGPT to expand the puzzle into a set of logical constraints. You're allowed to drop sections from the chain when sending context to GPT-4 , as long as the decision to drop does not depend on the contents of any section.

Candidate solutions will be converted to code and run using a script(Mira-public/manifold-sudoku (github.com )). You are not required to interact with this script when submitting a solution, but @Mira will attempt to use it to judge your solution so it may help in understanding the format.

Language modeling capabilities means that GPT-4 is not allowed to use any external tools, plugins, recursive invocations, or resources to aid in solving the Sudoku puzzle. It must rely solely on its language modeling capabilities and the context provided within the prompt. This is less relevant when using the API or Playground, and more relevant to using ChatGPT.
Reliably means the prompt succeeds at least 80% of the time, on freshly-generated puzzles. Occasionally means the prompt succeeds at least 20% of the time, on freshly-generated puzzles. I will run any proposed solution against 5 puzzles, with more testing to be done if it succeeds at least once or if there is disagreement in the comments about whether it meets a threshold(perhaps I got very unlucky). More testing means choosing a fixed pool of puzzles and calculating an exact percentage. I currently plan to choose "all easy-rated Sudoku puzzles in January 2024 from LA Times" as my pool. Since judging solutions requires me spending real money on API calls, I may optionally require collateral to be posted: $10 of mana(Ṁ1000) for quick validation, and $100 of mana(Ṁ10k) for extended validation. Collateral will be posted as a subsidy to an unlisted market that resolves NA if the candidate passes testing, or collected equal to Mira's API costs if not. Anyone can post collateral for a candidate, not just the submitter. Detailed testing will be done with the API set to temperature 0, not ChatGPT.
@Mira as market creator will trade in this market, but commits not to post any solution, or to provide prompts or detailed prompting techniques to other individuals. So if it resolves YES or 50%, it must be the work of somebody other than Mira.

Example Puzzles

From Sudoku - New York Times Number Puzzles - The New York Times (nytimes.com ) on March 28. 2023, "Easy"

210000487

800302091

905071000

007590610

560003002

401600700

039007000

700100026

100065009

Solution:

213956487

876342591

945871263

327594618

568713942

491628735

639287154

754139826

182465379

Related Markets

Main market: /Mira/will-a-prompt-that-enables-gpt4-to
GPT-4 any puzzle: /Mira/m100-subsidy-will-gpt4-solve-any-fr-c5b090d547d1
GPT-3.5 any puzzle, no finetuning: /Mira/will-gpt35-solve-any-freshlygenerat
GPT-3.5 any puzzle, finetuning allowed: /Mira/will-finetuned-gpt35-solve-any-fres
Group including other related markets: https://manifold.markets/questions?topic=gpt4-sudoku-challenge-2023

Edit History

Mar 26, 2:53pm: ~~Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)~~ → (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
Mar 27 - Clarified that judging will use freshly-generated puzzles.
Mar 29 - Added example with Chat Completions API to help specify allowed prompts.
Apr 3 - Clarified that dropping Chat Completion API turns is allowed.
Apr 20 - Added a more formal description of the solution format.
Apr 21 - Candidate solutions must be posted in the comments before market close.
Apr 27, 6:43am: ~~(M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)~~ → (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
Apr 30, 1:57am: ~~(M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)~~ → (M20000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
April 30, 2:57 am: Added that the percentage is defined against a fixed pool of puzzles, if it solves at least one in a preliminary test of 5.
April 30, 5:37 am: Judging will be done with the API. ChatGPT may be accepted if it has a high success rate, but if there's any debate I will use the API with temperature 0. New York Times is chosen as the presumptive source of Sudoku puzzles.
May 5, 2 pm: Link to script on Github, changed puzzle provider to LA Times.
May 7, 3 pm: Details on posting collateral for API costs.
July 16, 7:38 AM: @Mira conflict of interest commitment.
August 8, 2:45 PM: Input representation can be any 81-slot substitution string.
August 15: NO clause for if OpenAI shuts down.
August 23: Truncating a message is allowed.
August 28: You're allowed to make multiple OpenAI API calls to generate a single logical response, to work around limitations of their API.
September 22: Related markets; finetuning and GPT-3.5 aren't allowed.
November 13: "finish_reason" in the API allows the model to stop chunked outputs, so the 128k context GPT-4 is allowed to have a single chunked 128k output, not 4k like you might assume. Also added countdown timer by popular request.

Technology

OpenAI

Technical AI Timelines

ChatGPT

GPT-4 speculation

2023 Market of the Year Candidates

GPT-4 Sudoku Challenge (2023)

Contests

New Year's Resolutions 2024

— LLM & AI Capabilities—

Showcase

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ13,830
2		Ṁ9,655
3		Ṁ8,542
4		Ṁ6,951
5		Ṁ6,889

People are also trading

Will GPT-5 be able to solve A::B system puzzles consistently

Sort by:

Update from the future:

I tried the latest LA Times Sudoku puzzle (29 March 2025):

<sudoku_grid>
008 000 090

006 800 003

109 000 560

040 000 601

037 100 080

912 600 000

760 320 900

004 910 000

000 407 002
</sudoku_grid>

I then gave Gemini 2.0 Flash Thinking + Gemini 2.5 Pro an AI-crafted prompt (made with Anthropic's meta-prompt) to solve the Sudoku, where they are instructed to use scanning + candidate elimination + hidden single recursively. Flash Thinking failed, but 2.5 Pro solved it!

Single-turn zero-shot pass@1.

<answer>
328 576 194
456 891 273
179 243 568
845 739 621
637 152 489
912 684 357
761 328 945
284 915 736
593 467 812
</answer>

Logs for Gemini 2.5 Pro are here, 2.0 Flash Thinking's logs are here.

Maybe this can be a benchmark?

@bohaska clarification: the initial prompt was mostly made by Flash Thinking but after seeing that AI flounder a bit, I added the hidden singles technique to the instructions and called it "number filling" because I didn't know what it was called before I looked it up

Wait wait wait, this market should resolve N/A! As said in the description:
"""
Resolves as NA if:

This market does not resolve NA.

"""

The market resolved 50%, which is not N/A, so it should resolve N/A for not being resolved N/A!

(Im joking, of course)

@mods

Why is the person who decides how this market resolves, #1 in profit. Conflict of interest?

@NicoleWilson Do you believe that it was resolved incorrectly? If the resolution criteria are sufficiently objective and comprehensive then I don't think there's a conflict of interest in betting on one's own market. (Sometimes even with criteria that seemed objective and comprehensive you do end up with a weird edge case, and in that situation a creator who's bet on their own market should probably get some outside opinions. But this resolution seems straightforward enough.)

@NicoleWilson The only market you ever bet in on this platform was one you resolved, with you being #1 in profit.

Where did the 0.5 come from?

Not even sure if I knew that was possible when I went short at 40% or whatever last August, seemed like a binary at the time...

Looks like market moved after the link was posted and then everyone got jazzed. But the solution only works like 40% of the time and you have to do engineering to use it(?).

To me that's a 0 / fail / halt but idk.

https://github.com/iamthemathgirlnow/sudoku-challenge-solution/tree/main

@BenjaminCosman my question is less was it resolved according to the criteria and more, did the criteria change along the way?

My memory sucks, but I don't remember a 0.5 being an option for 'it kinda gets some of them.'

Maybe there was and I should have read closer but the question is more did the criteria change as people got hyped and then wanted a compromise (which wasn't really a compromise).

@AlexanderLeCampbell The 0.5 option is very clear in the currently-visible resolution criteria (if you can't be bothered to read all of it, then Ctrl-F for "Resolves 50% if:" and then for "Occasionally means the prompt succeeds at least 20% of the time". You can review the market history from the three-dots menu followed by the History button; I'm not going to check every version for you but I can confirm that the 0.5 option existed both in the very first version and in the version during which I personally first bought in.

(And if you're wondering about my potential bias here: I was actually one of the top 20 NO holders, so 50% resolution instead of NO hurt. But it was the obviously-correct answer based on the stated criteria, sorry.)

predictedYES

Let's go! I outperformed one person!

predictedNO

@8 I outperformed 3135 people. That is odd. My behavior in this market was very stupid and irrational. I am happy I did not go bankrupt. I bought a lot of NO when it was really super high saving me. Its funny no one won that much money on this market. And both people in YES and NO camp ended up top profiteers and losers.

predictedYES

@8 😬

I've finished testing @EmilyThomas ' December 1 solution. It successfully solved 10/17 puzzles in the January pool of easy LA Times Sudoku puzzles before resolution was guaranteed, making the final solve percentage 59%. The other 14 days in January will not be run to save on API costs, since they cannot reach 80% solve rate.

Resources:

Summary of Test Results
Testing transcripts
Emily's Prompt Submission on December 1, 2023, which was tested on the January pool
List of puzzles in the January pool

@EmilyThomas may be writing an article about the design of her Sudoku-solving prompt, and I may be writing one about this contest, the original motivations for setting it up, and the useful implications of prompts that can execute arbitrary algorithms.

I also plan to give this a challenge a try, though my own attempts won't count for this contest.

Thanks everyone for participating!

predictedNO

It says 2023

reposted

@traders This market is now guaranteed to resolve 50% or YES. A NO resolution is no longer possible.

Please review my testing summary table and let me know if you find any errors in the test transcripts on Github that could change the result.

We are at 5 failures currently. 7 failures precludes a YES resolution, so Emily's prompt must solve every puzzle remaining in January with at most 1 failure to get a YES resolution. I will stop testing early and resolve to 50% when I see 2 more failures.

predictedNO

@Mira wait, the timeline for the market was set to the end of 2023. It is 2024 now

predictedYES

@bfdc And?

predictedYES

@bfdc the market closed for bets and submissions and the prompt is being tested for resolution. it's in previous comments including here.

predictedYES

@bfdc End of 2023 is when the market closes for trading and the deadline for submissions. Any submissions still need to be tested, which in the description is specified to use the January pool of puzzles from the LA times.

Otherwise, how did you imagine a solution submitted on December 31 would have its solve percentage calculated, given the requirement for freshly-generated puzzles?

predictedNO

@Mira got it

predictedNO

Had a dream about this being resolved YES. 🤣

predictedYES

@Undox At this point it's 1 world in 120.

predictedNO

@EmilyThomas Interesting. Is there a reason to believe that the 73/92 success rate in October, November, December is maybe just a lucky fluke? I place the 85% confidence interval at 13%. I am more optimistic than 1 in 120 for a YES resolution possibility, so I have been hedging quite a bit against it since I hold so much NO in this market.

predictedNO

well 3 auto fail in a row seems good reason to be long on any possibility of a YES resolution.

predictedYES

@ShitakiIntaki

well 3 auto fail in a row seems good reason to be long on any possibility of a YES resolution.

Oh yeah, I check the puzzle as soon as it comes out on the LA times site, giving me a few hours heads up on the derivative markets. I don't even have to use the simulation script, I just solve it using the same method as the prompt, and it either works or gets stuck.

...But apparently I posted the 1 in 120 value before today's puzzle came out?

Yeah, I have no idea how I came up with that number. You can get a lot of variability by using slightly different values for both the rate of eligible puzzles, and the rate of success on the eligible puzzles. I do not remember what values I used to get 1 in 120.

Even now the odds of a yes range from 1 in 200, to 1 in 7, depending on what assumptions are made about the success rate on eligible puzzles. So I have no idea what the actual value is.

People are also trading

Will GPT-5 be able to solve A::B system puzzles consistently

15% chance

Will GPT-4 escape?

5% chance

Resolution Criteria

Resources

Definitions

Formal definition of Solution

Examples

Example Puzzles

Related Markets

Edit History

🏅 Top traders

People are also trading

People are also trading

Related questions