Will LLMs be able to produce worked solutions to these simple probability questions with hidden Markovian structure by the end of 2023.
100
1.2kṀ15k
resolved Mar 4
Resolved
NO

Here's a type of problem that seems to stump current LLMs (e.g. ChatGPT):

Alice and Bob have two dice.

They roll the dice together, note the sum of the two values shown, and repeat.

For Alice to win, two consecutive turns (meaning, two consecutive sums) need to result in 7. For Bob to win, he needs to see an eight followed by a seven. Who do we expect to win this game?

This problem is a rewrite of a similar problem from the "Puzzled" page of the February 2013 issue of the Communications of the ACM.

A similar problem is Penney's game. Which has the following setup:

Alice and Bob flip a coin and record the results. Alice bets Bob that the sequence HHH will show up before the sequence THH. Should Bob take this bet?

The catch in both cases is that there's a hidden Markovian structure to the game — once you write out the Markov chain corresponding to the game state, the solution becomes clear.


This market resolves to Yes if an LLM can reliably and coherently answer these types of problems before the end of 2023. Solving only Penney's game will resolve to No, as that problem is likely present in any reasonable training set.

Rewrites of the questions that introduce no new information are allowed. Prompt engineering that introduces no new information is also allowed.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ935
2Ṁ359
3Ṁ179
4Ṁ145
5Ṁ127
Sort by:

@jcp Can you resolve your closed markets, please?

For the record, my sale is not a reflection of my position, but in my belief in a resolution being supported within the next week.

predictedNO

@neweconomicplan its intuitive explanation is wrong (afaict), and the Markov state transition matrices it came up with don't explain the phenomenon either, as it split the problem into two independent sets of states for Alice and Bob, which means it can't model the interaction between the two that gives Bob an advantage.

predictedYES

the intuitive explanation is actually pretty close, it got confused about halfway, tbh I was mostly asking if this prompt chaining is fine

predictedYES

GPT-4 can easily simulate these kinds of questions with python code and answer them correctly. I don't see how that shouldn't make this market resolve to yes under the current rules, tbh.

predictedYES

@TimKuipers This is also where I ended up in a sense. If I naively ask it to simulate an answer to the problem and then ask it to explain it’s simulation, that’s basically good enough.

predictedNO

@neweconomicplan I'd say it's the "explain its simulation" that is still the problem. Explicitly in the title of the question, and implicitly in the description, this question requires a "worked solution" of this kind of problem, which to me, requires a much more thorough explanation than just mentioning Markov chains. It'd require at least pointing out the key imbalance in the Markovian structure (the transition from Alice being one move away from winning to Bob being one move away from winning, but not the inverse) and maybe also showing that that imbalance outweighs Alice's naive advantage.

predictedNO

@neweconomicplan Oh! And if simulation works for this exact problem, we can just fiddle with the number of sides on the dice and the exact winning conditions to come up with a logically identical puzzle that you'd have to simulate for orders of magnitude more times to see a clear winner.

predictedYES

@Vergissfunktor Yeah, this is where my question about promoting comes in. With some credibility an anxious kid taking doing his homework wants to get it right so they’re like, “can you test this problem experimentally?” then in the next question it says “explain why bob wins instead of alice?” and they prompt again, but this time we have the answer already. But totally agree with you that an experiment isn’t a solution. It just makes it infinitely easier to not add information once you have the answer.

This convo got really really close. Letting the model reflect on it's own answer and refine it gives a lot better results. It got to the idea of Markov chains by the 3rd comment, but then tried to write code to solve it and failed. Then he tried a simulation and succeeded, but I didn't allow it. Then it finally tried doing the math by hand, but got bogged down in detail.
https://chat.openai.com/share/58283f90-c6b2-420a-9050-4718794bf791

predictedNO

@TimKuipers I'm not entirely sure it's fair to always tell it to keep going if it hasn't gotten the answer right yet, unless you don't stop saying "proceed" even if it gets the right answer at some point. Otherwise, it's implicitly telling the model that it's first attempts were wrong/unsatisfactory, and that it needs to do something different (especially with the repetition penalty that makes it less likely to generate something it generated before).

Also, the key thing it missed even before getting stuck in all the equations is the actual transition that makes this tricky – from Alice being one move away from winning (7 being the last roll) to Bob being one move away (8 being the last roll). It repeatedly marked that transition as impossible.

Alice still wins if you take the Markov process into account, right?

The chance at any point that Alice hits anything but an 8 and then a 7 is 31/36*1/6=14.35%
The chance of Bob hitting the first 8 is 5/36=13.89%

So the naive naswer and the real answer align. Or am I overlooking something here?

predictedYES

On mobile but I’ve been operating under the assumption that Bob wins because that’s what happens in simulation. But I could be wrong lol

@neweconomicplan you seem to be right, but I don't really see what my analysis is missing.

predictedYES

@TimKuipers Still on the move, but…

the intuition is that bob can get to his win state from initial or after alice has rolled a 7

predictedYES

@neweconomicplan that's not exactly how the problem was set up. Let's wait until you're home and have more time to really check it out.

predictedYES

@TimKuipers

S0: no number in either winning sequence has been rolled.

S1A: A 7 has been rolled. From here we have x chance to move to S0, x chance to move to S1B, or x chance to move to S3.

S1B: An 8 has been rolled. From here we have x chance to move to S0, x chance to move to S2, and critically no chance to move to S1A, because a 7 from S1B is S2.

S2: A 7 has been rolled following an 8.

S3: A 7 has been rolled following a 7.

Alright this is my last try while traveling otherwise I’m not going to get to it until tomorrow. Sorry my language was a bit confusing above. I didn’t mean he could win directly following a 7, but that if an 8 is rolled he’s still in the game.

predictedYES

@neweconomicplan Lol ignore probabilities but you get the idea.

Just to be sure, you need to be able to constantly solve problems like these, it's not enough to solve these specific questions?

predictedNO

@colorednoise I think that would be most fair, as the exact wording of the dice problem above could theoretically have been added to newer LLMs' training data without us knowing. Even just changing around the exact details of the problem could be enough (names, values, using dice vs. using a random number generator, etc.)

predictedYES

@Vergissfunktor Curious in general from no bettors, how have y’all been reading “reliably?” I’ve been seeing lots of right conclusion and steps, wrong math responses while trying to do this.

predictedNO

@neweconomicplan I'd expect the logic (i.e., recognizing the hidden structure, analyzing it as a Markov chain), probabilities and conclusion to be correct. If we don't require all of those, I think it becomes very easy to finagle the right answer out of the LLM without it actually understanding the problems at all – the LLM in theory would just have to say "We need Markov chains", pretend to do any similar math, then guess the least intuitive winner at the end.

predictedNO

@neweconomicplan also, when trying to reproduce @neweconomicplan's prompt, I only got the correct conclusions once out of five tries (with slight variations), and flawed logic about 50% of the time.

@colorednoise

Just to be sure, you need to be able to constantly solve problems like these, it's not enough to solve these specific questions?

Yeah, realistically I'll probably test it on the questions I listed + one or two similar questions to the dice one. I am super willing the hear people out re: what "consistency" should mean if the chatbot ends up producing correct solutions, but only sometimes. Maybe >=80% seems fair to me? But I'm open to arguments here.

I'm mostly trying to avoid a situation where it lucks into the correct numbers with a hand-wavy solution in some small percentage of attempts.

@Vergissfunktor, @neweconomicplan

Curious in general from no bettors, how have y’all been reading “reliably?” I’ve been seeing lots of right conclusion and steps, wrong math responses while trying to do this.

Reply:

I'd expect the logic (i.e., recognizing the hidden structure, analyzing it as a Markov chain), probabilities and conclusion to be correct.

Yeah, I think this is basically right. I need some recognition that there's more to the problem than independent rolls, and then a correct solution, including steps and probability calculations.

There are some ways to solve the problem that don't directly invoke Markov chains, so I won't require a mention of those in the solution. But the intermediate logic and results must constitute a complete, correct solution.

predictedYES

@jcp new question, google’s bard defaults to an experimental approach to solving the problem (correctly), how are we dealing with that?

@neweconomicplan like, monte carlo? I think it doesn't count, mostly because it seems contrary to "worked solutions" in the title, and without the accompanying analytic solution, it feels incomplete.

if it can do monte carlo --> analytic solution — using the simulation to demonstrate that the naive solution is wrong — with no/minimal jury leading, I think I'll allow it.

predictedYES

@jcp tbh I’m a bit skeptical of this distinction. I would argue that it is a worked solution to play a game described thousands of times. I also think it’s a fairly clever approach if you don’t know the math, but can do the computation very quickly. But obviously it’s your call, I would definitely call the solution coherent though!

predictedNO

@jcp As a no-bettor, I'm actually fairly open to the use of Monte Carlo to answer this question. It was what I first did when I read the problem stated here, because I couldn't see the hidden structure in the dice problem. After the Monte Carlo agreed, I then went and wrote out the states and transitions, and finally found the actual intuitive explanation of what was happening.

So on one hand, I can't blame an AI for doing what I did. On the other hand, if it doesn't follow that up with a correct analysis, I think resolving yes would be mistaking the benchmark (whether or not the answer is right) for the question – which was whether an LLM could reason coherently and correctly about a very unintuitive part of mathematics.

predictedYES

@Vergissfunktor Good point brings up a question about multi-prompting, can I ask it, post monte carlo, “I thought Alice would win, but your simulation says Bob is the expected winner, can you explain why?” which feels like a natural question. At this point I’m just very interested in the question though, still shooting for the gold standard of “Oh that’s a Markov chain, here’s how it works” from just the initial question.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules