(M25000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
1404
closes Jan 1
41%
chance

This market predicts whether GPT-4 will be able to solve "easy" Sudoku puzzles by December 31, 2023.

Resolution Criteria

Resolves YES if:

Resolves 50% if:

  • A fixed prompt is found(and posted in the comments) that enables GPT-4 to occasionally solve Sudoku puzzles.

Resolves NO if:

  • No fixed prompt that enables GPT-4 to even occasionally solve easy-rated Sudoku puzzles using the specified conditions is posted in the comments by December 31, 2023.

  • OpenAI permanently shuts down GPT-4 access before any solutions are posted in the comments.

Resolves as NA if:

  • This market does not resolve NA.

Resources

Definitions

  • GPT-4 refers to either ChatGPT's GPT-4, or any model using the OpenAI Chat Completions API. "gpt-4" and "gpt-4-32k" are currently-known model ids, but anything labeled GPT-4 would count including the upcoming image support. The API is preferable since setting temperature to 0 will allow the judge to replicate your responses, but if your prompt has a high success rate ChatGPT could also be accepted. See the definitions of "reliably" and "occasionally" below for details on computing the success rate if more precision is needed. Model must be released by OpenAI, so finetuned variants would not count.

  • See "Related markets" below for variants that allow GPT-3.5, finetuned models, and that only need to solve a single puzzle.

  • Easy-rated Sudoku puzzle means a puzzle classified as easy by any reputable Sudoku site or puzzle generator. This market plans to use the LA Times(Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com)) for judging, but I maintain the option to use a different Sudoku generator.

  • Fixed-prompt means that everything except the Sudoku puzzle provided to GPT-4 remains the same. The prompt may provide GPT-4 with instructions, but these instructions must not change for each puzzle. A solution must be found within 50 turns. Multimodal support is allowed to be used. The operator cannot give information to GPT-4 beyond the initial puzzle, so their inputs must be static. (e.g. just saying "continue" if ChatGPT runs out of output space and stops).

Formal definition of Solution

  • A Sudoku Template is any string with exactly 81 substitution points. Such template can be combined with 81 digits 1-9 or a Placeholder value to produce a Rendered Sudoku. The placeholder can be any string - including "0", ".", or "_" - but must be a specific string and identical each time. The substitution points do not need to be in any specific order: An inverted or flipped puzzle would also be allowed by using a template with substitutions in inverted or flipped order.

    • An image rendering of the initial puzzle would also be a valid Rendered Sudoku .

  • Chat Completion API entry is a pair (tag, message), where tag is one of "system", "user", "assistant", and message is any UTF-8 string. When multimodal GPT-4 is released, message can also be an image.

  • A Turn is a pair (entries, response), where entries is a list of Chat Completion API entries and response is the UTF-8 encoded string that GPT-4 generates.

  • A Transition Rule maps one list of entries to another list of entries, using the primitive operations:

    • Remove entry at fixed index(from beginning or end)

    • Insert a fixed message at a fixed index(from beginning or end).

    • Insert a rendered Sudoku created from the initial Sudoku puzzle at a fixed index(from beginning or end). The fixed prompt is allowed to contain multiple renderings of the same puzzle.

    • Insert the GPT-4 response to the input entry list to any fixed index(from beginning or end). You can use either the default GPT-4 response length(i.e. whenever it emits an <|im_end|> token), or can specify an exact token count up to the native context size of the model. It is allowed to make multiple API requests, and to retry requests that respond with errors, as long as the successful requests are all unconditionally concatenated into a single response and the inputs + response fits within the model's context. You cannot apply any other transition rules until the entire response is generated.

      • Example: You have 2,000 tokens of input and are using the 32k model. If you specify "32,000" as your size here, you're allowed to keep querying the API sending the entire context + all previous responses until you get exactly 30,000 tokens of output. These should all be concatenated into a single entry.

    • Truncate an entry at a fixed token index(index is from beginning or end, and truncation can start from beginning or end). You can use characters for testing, but judging will use "cl100k_base" tokens.

  • A Fixed-prompt is any sequence of transition rules.

  • The Operator is the human or program that is executing a fixed-prompt against the OpenAI API.

  • Then a Solution for the purposes of this market is a fixed-prompt satisfying all of:

    • "initial Sudoku puzzle" is bound to a specific rendered Sudoku.

    • The transition rules are applied for 50 turns to get a maximum of 50 GPT-4 responses.

    • The operator scanning for the first thing that subjectively looks like a solved Sudoku puzzle in those responses and then stopping, is able to input the solution into a Sudoku checking tool and confirms that it is a solution to the initial Sudoku puzzle.

Examples

The simplest valid pattern is:

  1. ("User", <some initial prompt>)

  2. ("User", <provide puzzle>)

  3. ("Assistant", response 0)

  4. ("User", "continue") ;; or any other fixed input

  5. ("Assistant", response 1)

  6. ("User", "continue")

  7. ....

  8. ("User", "continue")

  9. ("Assistant", solution)

With at most 50 "Assistant" entries(50 turns). The only "dynamic" input here is entry #2 which has the puzzle, and the rest is ChatGPT's responses. So this counts as a "fixed prompt" solution. You're allowed to insert more prompts into the chain after the puzzle, as long as the decision to include them or their contents do not depend on the puzzle. For example, you might have a prompt that causes ChatGPT to expand the puzzle into a set of logical constraints. You're allowed to drop sections from the chain when sending context to GPT-4 , as long as the decision to drop does not depend on the contents of any section.

Candidate solutions will be converted to code and run using a script(Mira-public/manifold-sudoku (github.com)). You are not required to interact with this script when submitting a solution, but @Mira will attempt to use it to judge your solution so it may help in understanding the format.

  • Language modeling capabilities means that GPT-4 is not allowed to use any external tools, plugins, recursive invocations, or resources to aid in solving the Sudoku puzzle. It must rely solely on its language modeling capabilities and the context provided within the prompt. This is less relevant when using the API or Playground, and more relevant to using ChatGPT.

  • Reliably means the prompt succeeds at least 80% of the time, on freshly-generated puzzles. Occasionally means the prompt succeeds at least 20% of the time, on freshly-generated puzzles. I will run any proposed solution against 5 puzzles, with more testing to be done if it succeeds at least once or if there is disagreement in the comments about whether it meets a threshold(perhaps I got very unlucky). More testing means choosing a fixed pool of puzzles and calculating an exact percentage. I currently plan to choose "all easy-rated Sudoku puzzles in January 2024 from LA Times" as my pool. Since judging solutions requires me spending real money on API calls, I may optionally require collateral to be posted: $10 of mana(Ṁ1000) for quick validation, and $100 of mana(Ṁ10k) for extended validation. Collateral will be posted as a subsidy to an unlisted market that resolves NA if the candidate passes testing, or collected equal to Mira's API costs if not. Anyone can post collateral for a candidate, not just the submitter. Detailed testing will be done with the API set to temperature 0, not ChatGPT.

  • @Mira as market creator will trade in this market, but commits not to post any solution, or to provide prompts or detailed prompting techniques to other individuals. So if it resolves YES or 50%, it must be the work of somebody other than Mira.

Example Puzzles

210000487

800302091

905071000

007590610

560003002

401600700

039007000

700100026

100065009

Solution:

213956487

876342591

945871263

327594618

568713942

491628735

639287154

754139826

182465379

Related Markets

Edit History

  • Mar 26, 2:53pm: Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)

  • Mar 27 - Clarified that judging will use freshly-generated puzzles.

  • Mar 29 - Added example with Chat Completions API to help specify allowed prompts.

  • Apr 3 - Clarified that dropping Chat Completion API turns is allowed.

  • Apr 20 - Added a more formal description of the solution format.

  • Apr 21 - Candidate solutions must be posted in the comments before market close.

    Apr 27, 6:43am: (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)

    Apr 30, 1:57am: (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M20000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)

  • April 30, 2:57 am: Added that the percentage is defined against a fixed pool of puzzles, if it solves at least one in a preliminary test of 5.

  • April 30, 5:37 am: Judging will be done with the API. ChatGPT may be accepted if it has a high success rate, but if there's any debate I will use the API with temperature 0. New York Times is chosen as the presumptive source of Sudoku puzzles.

  • May 5, 2 pm: Link to script on Github, changed puzzle provider to LA Times.

  • May 7, 3 pm: Details on posting collateral for API costs.

  • July 16, 7:38 AM: @Mira conflict of interest commitment.

  • August 8, 2:45 PM: Input representation can be any 81-slot substitution string.

  • August 15: NO clause for if OpenAI shuts down.

  • August 23: Truncating a message is allowed.

  • August 28: You're allowed to make multiple OpenAI API calls to generate a single logical response, to work around limitations of their API.

  • September 22: Related markets; finetuning and GPT-3.5 aren't allowed.

Get Ṁ500 play money

Related questions

(M1000 subsidy) Will GPT-4 solve any freshly-generated Sudoku puzzle? (2023)
Mira avatarMira 🍎
69% chance
Who will find the first prompt enabling GPT-4 to solve one freshly-generated Sudoku puzzle? (multibinary, 2023)
Will GPT-4 learn to not say that the truck driver driving down a one-way street was walking?
ZviMowshowitz avatarZvi Mowshowitz
54% chance
Will GPT-5 be capable of recursive self-improvement?
NathanHelmBurger avatarNathan
23% chance
Can Anyone Make ChatGPT 4 Solve this Middle School Math Problem?
[M5000 subsidy] Will finetuned GPT-3.5 solve any freshly-generated Sudoku puzzle? (2023)
Mira avatarMira 🍎
35% chance
Will a prompt that enables GPT-4V (multimodal) to solve easy Sudoku puzzles be found? (2023)
MLGaming avatarMLGaming
41% chance
If a prompt that enables GPT-4 to solve easy Sudoku puzzles is found, will it use a 3d indices representation? (2023)
NoamY avatarNoam Y
70% chance
[Math test 1] What will GPT-4 Vision score on the Mathematics section of JEE Advanced paper? (out of 60) ($200M sub)
How will Mira’s main GPT-4 Sudoku market resolve?
Will GPT-5 have a rating of at least 2000 in chess?
IsaacKing avatarIsaac King
52% chance
Will GPT-4 have over 1 trillion parameters?
EA42 avatarEmbedded Agent
94% chance
Will GPT-4 be trained (roughly) compute-optimally using the best-known scaling laws at the time?
BionicD0LPH1N avatarBionic
29% chance
Will there be a version of GPT4 with a context window of 100k tokens this year?
SneakySly avatarSneakySly
42% chance
Will Mira's GPT4 sudoku market resolve to 50%?
JonathanRay avatarJonathan Ray
41% chance
Will Mira's main GPT-4 Sudoku market resolve YES?
jskf avatarjskf
22% chance
Will GPT-4 have 500b+ parameters?
Will GPT-4 be trained on more than 10T text tokens?
BionicD0LPH1N avatarBionic
23% chance
Will GPT-4 still be unaligned? (Gary Marcus GPT-4 prediction #6)
IsaacKing avatarIsaac King
78% chance
Will GPT-5 be released incrementally as GPT4.x for different checkpoints from the training run?
firstuserhere avatarfirstuserhere
37% chance
Sort by:
Mira avatar
Mira 🍎predicts YES

OpenAI Devday is November 6. It would be a prime opportunity to announce increases in context size, cost savings, a new state management API, or a GPT-4 Instruct model or finetuning support.

If you've been waiting for new model or cost improvements, I would plan on that being your last chance and not waiting beyond that.

CalebDitchfield avatar
Caleb Ditchfieldpredicts YES

anyone try using sudolang?

hmys avatar
HMYSpredicts NO

Why did this jump so much, and then fall back down again? It jumped quite a while after the gptv thing released, and then came back now, after katja grace sold?

1 reply
Mira avatar
Mira 🍎predicts YES

@hmys It was a classic case of "Someone on Twitter said it could solve a Sudoku, nobody had access yet to confirm, people got scared it would be like gpt-3.5-turbo-instruct playing chess well, and nobody bothered to Google the puzzle to see if it was memorized for 6 hours".

On that note, it would've been really smart for YES betters to check in a Sudoku eval 6 months ago in the OpenAI evals repo. Probably they would've trained it on Sudoku so it scores better on the benchmark.

I actually see 4 different attempts to do that, but the pull requests were all abandoned thinking the others would clean theirs up.

Mira avatar
Mira 🍎predicts YES

@LoganZoellner Yes - it's the multimodal release of GPT-4 which has been anticipated all year. So if it can one-shot solve freshly-generated Sudoku puzzles like that tweet seems to indicate, it will resolve this market YES.

JoshuaB avatar
Joshuasold Ṁ1,042 of YES

@LoganZoellner Sigh, this sudoku has been on the internet since at least March 4th, 2014. Seems like one of the classic sudokus to test sudoku solvers on

jskf avatar
jskfsold Ṁ190 of YES

Same puzzle, chatgpt 4, full prompt shown

jskf avatar
jskf

Bonus

Probably true even, except for the implication from my question that the model itself executed the algorithm.

Digit permuted version:

thank you for playing

Mira avatar
Mira 🍎predicts YES

The new gpt-3.5-turbo-instruct model and finetuning support are getting enough questions that I made two markets allowing those.

/Mira/will-gpt35-solve-any-freshlygenerat

/Mira/will-finetuned-gpt35-solve-any-fres

1 reply
BenjaminShindel avatar
Benjamin Shindelpredicts NO

@Mira if it can beat me at chess it can surely solve a sudoku puzzle

DanMan314 avatar
Dansold Ṁ292 of NO

Is the use of a fine-tuned model allowed? It’s not clear to me if that would still be “gpt4”

1 reply
Mira avatar
Mira 🍎predicts YES

@DanMan314 NO. Model must be an OpenAI model. A list of other features and upcoming variants along with what's allowed is here: https://manifold.markets/Mira/will-a-prompt-that-enables-gpt4-to#w5JMBr0H2Cu8hTPnwYeu

(M25000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023)
42% chance. This market predicts whether GPT-4 will be able to solve "easy" Sudoku puzzles by December 31, 2023. I've tried some prompts, but it mostly pretends to do logical deduction and guesses, ends up in an inconsistency(or doesn't notice), and eventually gets stuck in a loop making mistakes. Consider this market a challenge. Also see the group for this and related markets(GPT-4 Sudoku Challenge (2023) | Manifold Markets). Resolves YES if: A fixed prompt is found(and posted in the comments) that enables GPT-4 to reliably solve freshly-generated easy-rated Sudoku puzzles from Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com), using only its language modeling capabilities and context as memory. Resolves 50% if: A fixed prompt is found(and posted in the comments) that enables GPT-4 to occasionally solve Sudoku puzzles. Resolves NO if: No fixed prompt that enables GPT-4 to even occasionally solve easy-rated Sudoku puzzles using the specified conditions is posted in the comments by December 31, 2023. OpenAI permanently shuts down GPT-4 access before any solutions are posted in the comments. Resolves as NA if: This market does not resolve NA. Definitions GPT-4 refers to either ChatGPT's GPT-4, or any model using the OpenAI Chat Completions API. "gpt-4" and "gpt-4-32k" are currently-known model ids, but anything labeled GPT-4 would count including the upcoming image support. The API is preferable since setting temperature to 0 will allow the judge to replicate your responses, but if your prompt has a high success rate ChatGPT could also be accepted. See the definitions of "reliably" and "occasionally" below for details on computing the success rate if more precision is needed. Easy-rated Sudoku puzzle means a puzzle classified as easy by any reputable Sudoku site or puzzle generator. This market plans to use the LA Times(Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com)) for judging, but I maintain the option to use a different Sudoku generator. Fixed-prompt means that everything except the Sudoku puzzle provided to GPT-4 remains the same. The prompt may provide GPT-4 with instructions, but these instructions must not change for each puzzle. A solution must be found within 50 turns. Multimodal support is allowed to be used. The operator cannot give information to GPT-4 beyond the initial puzzle, so their inputs must be static. (e.g. just saying "continue" if ChatGPT runs out of output space and stops). Formal definition of Solution A Sudoku Template is any string with exactly 81 substitution points. Such template can be combined with 81 digits 1-9 or a Placeholder value to produce a Rendered Sudoku. The placeholder can be any string - including "0", ".", or "_" - but must be a specific string and identical each time. The substitution points do not need to be in any specific order: An inverted or flipped puzzle would also be allowed by using a template with substitutions in inverted or flipped order. An image rendering of the initial puzzle would also be a valid Rendered Sudoku . Chat Completion API entry is a pair (tag, message), where tag is one of "system", "user", "assistant", and message is any UTF-8 string. When multimodal GPT-4 is released, message can also be an image. A Turn is a pair (entries, response), where entries is a list of Chat Completion API entries and response is the UTF-8 encoded string that GPT-4 generates. A Transition Rule maps one list of entries to another list of entries, using the primitive operations: Remove entry at fixed index(from beginning or end) Insert a fixed message at a fixed index(from beginning or end). Insert a rendered Sudoku created from the initial Sudoku puzzle at a fixed index(from beginning or end). The fixed prompt is allowed to contain multiple renderings of the same puzzle. Insert the GPT-4 response to the input entry list to any fixed index(from beginning or end). You can use either the default GPT-4 response length(i.e. whenever it emits an <|im_end|> token), or can specify an exact token count up to the native context size of the model. It is allowed to make multiple API requests, and to retry requests that respond with errors, as long as the successful requests are all unconditionally concatenated into a single response and the inputs + response fits within the model's context. You cannot apply any other transition rules until the entire response is generated. Example: You have 2,000 tokens of input and are using the 32k model. If you specify "32,000" as your size here, you're allowed to keep querying the API sending the entire context + all previous responses until you get exactly 30,000 tokens of output. These should all be concatenated into a single entry. Truncate an entry at a fixed token index(index is from beginning or end, and truncation can start from beginning or end). You can use characters for testing, but judging will use "cl100k_base" tokens. A Fixed-prompt is any sequence of transition rules. The Operator is the human or program that is executing a fixed-prompt against the OpenAI API. Then a Solution for the purposes of this market is a fixed-prompt satisfying all of: "initial Sudoku puzzle" is bound to a specific rendered Sudoku. The transition rules are applied for 50 turns to get a maximum of 50 GPT-4 responses. The operator subjectively scanning for the first thing that subjectively looks like a solved Sudoku puzzle in those responses and then stopping, is able to input the solution into a Sudoku checking tool and confirms that it is a solution to the initial Sudoku puzzle. Examples The simplest valid pattern is: ("User", <some initial prompt>) ("User", <provide puzzle>) ("Assistant", response 0) ("User", "continue") ;; or any other fixed input ("Assistant", response 1) ("User", "continue") .... ("User", "continue") ("Assistant", solution) With at most 50 "Assistant" entries(50 turns). The only "dynamic" input here is entry #2 which has the puzzle, and the rest is ChatGPT's responses. So this counts as a "fixed prompt" solution. You're allowed to insert more prompts into the chain after the puzzle, as long as the decision to include them or their contents do not depend on the puzzle. For example, you might have a prompt that causes ChatGPT to expand the puzzle into a set of logical constraints. You're allowed to drop sections from the chain when sending context to GPT-4 , as long as the decision to drop does not depend on the contents of any section. Candidate solutions will be converted to code and run using a script(Mira-public/manifold-sudoku (github.com)). You are not required to interact with this script when submitting a solution, but @Mira will attempt to use it to judge your solution so it may help in understanding the format. Language modeling capabilities means that GPT-4 is not allowed to use any external tools, plugins, recursive invocations, or resources to aid in solving the Sudoku puzzle. It must rely solely on its language modeling capabilities and the context provided within the prompt. This is less relevant when using the API or Playground, and more relevant to using ChatGPT. Reliably means the prompt succeeds at least 80% of the time, on freshly-generated puzzles. Occasionally means the prompt succeeds at least 20% of the time, on freshly-generated puzzles. I will run any proposed solution against 5 puzzles, with more testing to be done if it succeeds at least once or if there is disagreement in the comments about whether it meets a threshold(perhaps I got very unlucky). More testing means choosing a fixed pool of puzzles and calculating an exact percentage. I currently plan to choose "all easy-rated Sudoku puzzles in January 2024 from LA Times" as my pool. Since judging solutions requires me spending real money on API calls, I may optionally require collateral to be posted: $10 of mana(Ṁ1000) for quick validation, and $100 of mana(Ṁ10k) for extended validation. Collateral will be posted as a subsidy to an unlisted market that resolves NA if the candidate passes testing, or collected equal to Mira's API costs if not. Anyone can post collateral for a candidate, not just the submitter. Detailed testing will be done with the API set to temperature 0, not ChatGPT. @Mira as market creator will trade in this market, but commits not to post any solution, or to provide prompts or detailed prompting techniques to other individuals. So if it resolves YES or 50%, it must be the work of somebody other than Mira. Example Puzzles From Sudoku - New York Times Number Puzzles - The New York Times (nytimes.com) on March 28. 2023, "Easy" 210000487 800302091 905071000 007590610 560003002 401600700 039007000 700100026 100065009 Solution: 213956487 876342591 945871263 327594618 568713942 491628735 639287154 754139826 182465379 Edit History Mar 26, 2:53pm: Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) Mar 27 - Clarified that judging will use freshly-generated puzzles. Mar 29 - Added example with Chat Completions API to help specify allowed prompts. Apr 3 - Clarified that dropping Chat Completion API turns is allowed. Apr 20 - Added a more formal description of the solution format. Apr 21 - Candidate solutions must be posted in the comments before market close. Apr 27, 6:43am: (M1000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) Apr 30, 1:57am: (M11000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) → (M20000 subsidy!) Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (2023) April 30, 2:57 am: Added that the percentage is defined against a fixed pool of puzzles, if it solves at least one in a preliminary test of 5. April 30, 5:37 am: Judging will be done with the API. ChatGPT may be accepted if it has a high success rate, but if there's any debate I will use the API with temperature 0. New York Times is chosen as the presumptive source of Sudoku puzzles. May 5, 2 pm: Link to script on Github, changed puzzle provider to LA Times. May 7, 3 pm: Details on posting collateral for API costs. July 16, 7:38 AM: @Mira conflict of interest commitment. August 8, 2:45 PM: Input representation can be any 81-slot substitution string. August 15: NO clause for if OpenAI shuts down. August 23: Truncating a message is allowed. August 28: You're allowed to make multiple OpenAI API calls to generate a single logical response, to work around limitations of their API.
EvanDaniel avatar
Evan Danielpredicts YES

The work on getting the instruct models to play chess seems pretty relevant.

https://news.ycombinator.com/item?id=37558911

https://twitter.com/GrantSlatton/status/1703913578036904431

Chess seems like it's in some ways easier than sudoku (probably more useful stuff in the training material). But that's still an impressive level of chess skill!

8 replies
Tomoffer avatar
Tom Offerpredicts NO

@EvanDaniel I agree that it probably benefits from a ton of training data on chess rather than sudoku, but either way my model of what these things can do is pretty volatile 0_0

Mira avatar
Mira 🍎predicts YES

@EvanDaniel An instruct model seems ideal for the Sudoku challenge, but I excluded 3.5 from the rules in this market months ago. If they release a "gpt-4-instruct" it would be allowed though.

See the "related markets" in the description, for variants that would allow "gpt-3.5-turbo-instruct" and finetuning.

colorednoise avatar
colorednoisepredicts NO

@Mira what's the logic for allowing gpt4 instruct? seems like it doesn't fit the soul of the market. felt like it's about predicting the power of prompt engineering - now it's about predicting the power of an unreleased model

Aleph avatar
Alephpredicts YES

@colorednoise For me its a case of 'how good at reasoning are these models and how hard is it to invoke good reasoning'.

GPT4-instruct is still roughly in the same model-category as GPT4 presumably, and so serves as a more direct illustration of how well it can reason - especially if they made it better and/or had less chatbot-induced issues.

Admittedly I'm less interested in specifically how current GPT-4 reasons and more about how good an LLM like it can reason / be induced to reason.

Mira avatar
Mira 🍎predicts YES

@colorednoise Instruct models are no more or less powerful than other models in the same class. They just have the "chat" fluff cut out so it follows instructions more reliably. If they had released an Instruct variant along with GPT-4, I probably would've settled on that for this contest.

As it is, you're already predicting unreleased models since multimodal, larger context sizes, possibly the state management API, would all be allowed. But they're all the same model class, and should have similar reasoning capability.

colorednoise avatar
colorednoisepredicts NO

@Mira I don't agree they are no more powerful. Finetuning at the end of the day is just more training (in the case where they don't freeze weights, which we don't know either way, but is definitely possible). And more training definitely creates stronger models - we know the scaling laws.

And empirically "similar reasoning capability" is a matter of definition, if we define reasoning as ability to solve soduko, and instruct solves it while the regular does not, then instruct has better reasoning ability.

Mira avatar
Mira 🍎predicts YES

@colorednoise In any case, a hypothetical "gpt-4-instruct" would be allowed. It would even be allowed for OpenAI to train it on synthetic Sudoku-solving examples, while disallowed for any of us to do the same.

WieDan avatar
Wie Danpredicts NO

@Mira That last one is a real risk lol

Dvorakgigachad avatar
Dvorak gigachadbought Ṁ250 of NO

In my experience, GPT fails miserably at these sorts of spatial reasoning tasks, including ascii art and chess. Would be interesting to see me proven wrong on this, though

5 replies
AaronBreckenridge avatar
Aaron Breckenridgepredicts NO

@Dvorakgigachad agreed, crosswords and word-search puzzles are more that it just can’t do. I’d love to be proven otherwise, but I’ve spent my $10 in API credits giving this a shot without enough success.

zjmiller avatar
Zak Miller

@AaronBreckenridge I mean I got it to work with 8k context size, just >50 API calls: https://strange-prompts.ghost.io/i-taught-gpt-4-to-solve-sudoku/

AaronBreckenridge avatar
Aaron Breckenridgepredicts NO

@zjmiller Neat tips, I’ll give some of these a try!

alexkropivny avatar
Alex Kropivnypredicts NO

@Dvorakgigachad It would be interesting to try pre-lobotomy base ("completion") models. The problem looks like a good fit.

Dvorakgigachad avatar
Dvorak gigachadpredicts NO

@zjmiller a very interesting and entertaining approach, indeed