
This market predicts whether GPT-4 will be able to solve "easy" Sudoku puzzles by December 31, 2023.
Resolution Criteria
Resolves YES if:
A fixed prompt is found(and posted in the comments) that enables GPT-4 to reliably solve freshly-generated easy-rated Sudoku puzzles from Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com), using only its language modeling capabilities and context as memory.
Resolves 50% if:
A fixed prompt is found(and posted in the comments) that enables GPT-4 to occasionally solve Sudoku puzzles.
Resolves NO if:
No fixed prompt that enables GPT-4 to even occasionally solve easy-rated Sudoku puzzles using the specified conditions is posted in the comments by December 31, 2023.
OpenAI permanently shuts down GPT-4 access before any solutions are posted in the comments.
Resolves as NA if:
This market does not resolve NA.
Resources
Discord server: https://discord.gg/Y6qvtB5xPD
Github repository with solution judging script: https://github.com/Mira-public/manifold-sudoku
Manifold category for related markets: https://manifold.markets/questions?topic=gpt4-sudoku-challenge-2023
Definitions
GPT-4 refers to either ChatGPT's GPT-4, or any model using the OpenAI Chat Completions API. "gpt-4" and "gpt-4-32k" are currently-known model ids, but anything labeled GPT-4 would count including the upcoming image support. The API is preferable since setting temperature to 0 will allow the judge to replicate your responses, but if your prompt has a high success rate ChatGPT could also be accepted. See the definitions of "reliably" and "occasionally" below for details on computing the success rate if more precision is needed. Model must be released by OpenAI, so finetuned variants would not count.
See "Related markets" below for variants that allow GPT-3.5, finetuned models, and that only need to solve a single puzzle.
Easy-rated Sudoku puzzle means a puzzle classified as easy by any reputable Sudoku site or puzzle generator. This market plans to use the LA Times(Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com)) for judging, but I maintain the option to use a different Sudoku generator.
Fixed-prompt means that everything except the Sudoku puzzle provided to GPT-4 remains the same. The prompt may provide GPT-4 with instructions, but these instructions must not change for each puzzle. A solution must be found within 50 turns. Multimodal support is allowed to be used. The operator cannot give information to GPT-4 beyond the initial puzzle, so their inputs must be static. (e.g. just saying "continue" if ChatGPT runs out of output space and stops).
Formal definition of Solution
A Sudoku Template is any string with exactly 81 substitution points. Such template can be combined with 81 digits 1-9 or a Placeholder value to produce a Rendered Sudoku. The placeholder can be any string - including "0", ".", or "_" - but must be a specific string and identical each time. The substitution points do not need to be in any specific order: An inverted or flipped puzzle would also be allowed by using a template with substitutions in inverted or flipped order.
An image rendering of the initial puzzle would also be a valid Rendered Sudoku .
Chat Completion API entry is a pair (tag, message), where tag is one of "system", "user", "assistant", and message is any UTF-8 string. When multimodal GPT-4 is released, message can also be an image.
A Turn is a pair (entries, response), where entries is a list of Chat Completion API entries and response is the UTF-8 encoded string that GPT-4 generates.
A Transition Rule maps one list of entries to another list of entries, using the primitive operations:
Remove entry at fixed index(from beginning or end)
Insert a fixed message at a fixed index(from beginning or end).
Insert a rendered Sudoku created from the initial Sudoku puzzle at a fixed index(from beginning or end). The fixed prompt is allowed to contain multiple renderings of the same puzzle.
Insert the GPT-4 response to the input entry list to any fixed index(from beginning or end). You can use either the default GPT-4 response length(i.e. whenever it emits an <|im_end|> token), or can specify an exact token count up to the native context size of the model. It is allowed to make multiple API requests, and to retry requests that respond with errors, as long as the successful requests are all unconditionally concatenated into a single response and the inputs + response fits within the model's context. You cannot apply any other transition rules until the entire response is generated.
Example: You have 2,000 tokens of input and are using the 32k model. If you specify "32,000" as your size here, you're allowed to keep querying the API sending the entire context + all previous responses until you get exactly 30,000 tokens of output. These should all be concatenated into a single entry.
Truncate an entry at a fixed token index(index is from beginning or end, and truncation can start from beginning or end). You can use characters for testing, but judging will use "cl100k_base" tokens.
A Fixed-prompt is any sequence of transition rules.
The Operator is the human or program that is executing a fixed-prompt against the OpenAI API.
Then a Solution for the purposes of this market is a fixed-prompt satisfying all of:
"initial Sudoku puzzle" is bound to a specific rendered Sudoku.
The transition rules are applied for 50 turns to get a maximum of 50 GPT-4 responses.
The operator scanning for the first thing that subjectively looks like a solved Sudoku puzzle in those responses and then stopping, is able to input the solution into a Sudoku checking tool and confirms that it is a solution to the initial Sudoku puzzle.
Examples
The simplest valid pattern is:
("User", <some initial prompt>)
("User", <provide puzzle>)
("Assistant", response 0)
("User", "continue") ;; or any other fixed input
("Assistant", response 1)
("User", "continue")
....
("User", "continue")
("Assistant", solution)
With at most 50 "Assistant" entries(50 turns). The only "dynamic" input here is entry #2 which has the puzzle, and the rest is ChatGPT's responses. So this counts as a "fixed prompt" solution. You're allowed to insert more prompts into the chain after the puzzle, as long as the decision to include them or their contents do not depend on the puzzle. For example, you might have a prompt that causes ChatGPT to expand the puzzle into a set of logical constraints. You're allowed to drop sections from the chain when sending context to GPT-4 , as long as the decision to drop does not depend on the contents of any section.
Candidate solutions will be converted to code and run using a script(Mira-public/manifold-sudoku (github.com)). You are not required to interact with this script when submitting a solution, but @Mira will attempt to use it to judge your solution so it may help in understanding the format.
Language modeling capabilities means that GPT-4 is not allowed to use any external tools, plugins, recursive invocations, or resources to aid in solving the Sudoku puzzle. It must rely solely on its language modeling capabilities and the context provided within the prompt. This is less relevant when using the API or Playground, and more relevant to using ChatGPT.
Reliably means the prompt succeeds at least 80% of the time, on freshly-generated puzzles. Occasionally means the prompt succeeds at least 20% of the time, on freshly-generated puzzles. I will run any proposed solution against 5 puzzles, with more testing to be done if it succeeds at least once or if there is disagreement in the comments about whether it meets a threshold(perhaps I got very unlucky). More testing means choosing a fixed pool of puzzles and calculating an exact percentage. I currently plan to choose "all easy-rated Sudoku puzzles in January 2024 from LA Times" as my pool. Since judging solutions requires me spending real money on API calls, I may optionally require collateral to be posted: $10 of mana(Ṁ1000) for quick validation, and $100 of mana(Ṁ10k) for extended validation. Collateral will be posted as a subsidy to an unlisted market that resolves NA if the candidate passes testing, or collected equal to Mira's API costs if not. Anyone can post collateral for a candidate, not just the submitter. Detailed testing will be done with the API set to temperature 0, not ChatGPT.
@Mira as market creator will trade in this market, but commits not to post any solution, or to provide prompts or detailed prompting techniques to other individuals. So if it resolves YES or 50%, it must be the work of somebody other than Mira.
Example Puzzles
From Sudoku - New York Times Number Puzzles - The New York Times (nytimes.com) on March 28. 2023, "Easy"
210000487
800302091
905071000
007590610
560003002
401600700
039007000
700100026
100065009
Solution:
213956487
876342591
945871263
327594618
568713942
491628735
639287154
754139826
182465379
Related Markets
Main market: /Mira/will-a-prompt-that-enables-gpt4-to
GPT-4 any puzzle: /Mira/m100-subsidy-will-gpt4-solve-any-fr-c5b090d547d1
GPT-3.5 any puzzle, no finetuning: /Mira/will-gpt35-solve-any-freshlygenerat
GPT-3.5 any puzzle, finetuning allowed: /Mira/will-finetuned-gpt35-solve-any-fres
Group including other related markets: https://manifold.markets/questions?topic=gpt4-sudoku-challenge-2023
Edit History
Mar 26, 2:53pm:
Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)→ (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)Mar 27 - Clarified that judging will use freshly-generated puzzles.
Mar 29 - Added example with Chat Completions API to help specify allowed prompts.
Apr 3 - Clarified that dropping Chat Completion API turns is allowed.
Apr 20 - Added a more formal description of the solution format.
Apr 21 - Candidate solutions must be posted in the comments before market close.
Apr 27, 6:43am:
(M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)→ (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)Apr 30, 1:57am:
(M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)→ (M20000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)April 30, 2:57 am: Added that the percentage is defined against a fixed pool of puzzles, if it solves at least one in a preliminary test of 5.
April 30, 5:37 am: Judging will be done with the API. ChatGPT may be accepted if it has a high success rate, but if there's any debate I will use the API with temperature 0. New York Times is chosen as the presumptive source of Sudoku puzzles.
May 5, 2 pm: Link to script on Github, changed puzzle provider to LA Times.
May 7, 3 pm: Details on posting collateral for API costs.
July 16, 7:38 AM: @Mira conflict of interest commitment.
August 8, 2:45 PM: Input representation can be any 81-slot substitution string.
August 15: NO clause for if OpenAI shuts down.
August 23: Truncating a message is allowed.
August 28: You're allowed to make multiple OpenAI API calls to generate a single logical response, to work around limitations of their API.
September 22: Related markets; finetuning and GPT-3.5 aren't allowed.
Related questions

OpenAI Devday is November 6. It would be a prime opportunity to announce increases in context size, cost savings, a new state management API, or a GPT-4 Instruct model or finetuning support.
If you've been waiting for new model or cost improvements, I would plan on that being your last chance and not waiting beyond that.

Why did this jump so much, and then fall back down again? It jumped quite a while after the gptv thing released, and then came back now, after katja grace sold?

@hmys It was a classic case of "Someone on Twitter said it could solve a Sudoku, nobody had access yet to confirm, people got scared it would be like gpt-3.5-turbo-instruct playing chess well, and nobody bothered to Google the puzzle to see if it was memorized for 6 hours".
On that note, it would've been really smart for YES betters to check in a Sudoku eval 6 months ago in the OpenAI evals repo. Probably they would've trained it on Sudoku so it scores better on the benchmark.
I actually see 4 different attempts to do that, but the pull requests were all abandoned thinking the others would clean theirs up.
Does gpt4v count?
https://twitter.com/roytomhermann/status/1706861232152621320

@LoganZoellner Yes - it's the multimodal release of GPT-4 which has been anticipated all year. So if it can one-shot solve freshly-generated Sudoku puzzles like that tweet seems to indicate, it will resolve this market YES.

@LoganZoellner Sigh, this sudoku has been on the internet since at least March 4th, 2014. Seems like one of the classic sudokus to test sudoku solvers on


Bonus

Probably true even, except for the implication from my question that the model itself executed the algorithm.
Digit permuted version:

thank you for playing

The new gpt-3.5-turbo-instruct model and finetuning support are getting enough questions that I made two markets allowing those.

Is the use of a fine-tuned model allowed? It’s not clear to me if that would still be “gpt4”

@DanMan314 NO. Model must be an OpenAI model. A list of other features and upcoming variants along with what's allowed is here: https://manifold.markets/Mira/will-a-prompt-that-enables-gpt4-to#w5JMBr0H2Cu8hTPnwYeu
The work on getting the instruct models to play chess seems pretty relevant.
https://news.ycombinator.com/item?id=37558911
https://twitter.com/GrantSlatton/status/1703913578036904431
Chess seems like it's in some ways easier than sudoku (probably more useful stuff in the training material). But that's still an impressive level of chess skill!
@EvanDaniel I agree that it probably benefits from a ton of training data on chess rather than sudoku, but either way my model of what these things can do is pretty volatile 0_0

@EvanDaniel An instruct model seems ideal for the Sudoku challenge, but I excluded 3.5 from the rules in this market months ago. If they release a "gpt-4-instruct" it would be allowed though.
See the "related markets" in the description, for variants that would allow "gpt-3.5-turbo-instruct" and finetuning.

@Mira what's the logic for allowing gpt4 instruct? seems like it doesn't fit the soul of the market. felt like it's about predicting the power of prompt engineering - now it's about predicting the power of an unreleased model
@colorednoise For me its a case of 'how good at reasoning are these models and how hard is it to invoke good reasoning'.
GPT4-instruct is still roughly in the same model-category as GPT4 presumably, and so serves as a more direct illustration of how well it can reason - especially if they made it better and/or had less chatbot-induced issues.
Admittedly I'm less interested in specifically how current GPT-4 reasons and more about how good an LLM like it can reason / be induced to reason.

@colorednoise Instruct models are no more or less powerful than other models in the same class. They just have the "chat" fluff cut out so it follows instructions more reliably. If they had released an Instruct variant along with GPT-4, I probably would've settled on that for this contest.
As it is, you're already predicting unreleased models since multimodal, larger context sizes, possibly the state management API, would all be allowed. But they're all the same model class, and should have similar reasoning capability.

@Mira I don't agree they are no more powerful. Finetuning at the end of the day is just more training (in the case where they don't freeze weights, which we don't know either way, but is definitely possible). And more training definitely creates stronger models - we know the scaling laws.
And empirically "similar reasoning capability" is a matter of definition, if we define reasoning as ability to solve soduko, and instruct solves it while the regular does not, then instruct has better reasoning ability.

@colorednoise In any case, a hypothetical "gpt-4-instruct" would be allowed. It would even be allowed for OpenAI to train it on synthetic Sudoku-solving examples, while disallowed for any of us to do the same.
In my experience, GPT fails miserably at these sorts of spatial reasoning tasks, including ascii art and chess. Would be interesting to see me proven wrong on this, though

@Dvorakgigachad agreed, crosswords and word-search puzzles are more that it just can’t do. I’d love to be proven otherwise, but I’ve spent my $10 in API credits giving this a shot without enough success.

@AaronBreckenridge I mean I got it to work with 8k context size, just >50 API calls: https://strange-prompts.ghost.io/i-taught-gpt-4-to-solve-sudoku/


@Dvorakgigachad It would be interesting to try pre-lobotomy base ("completion") models. The problem looks like a good fit.
























