Will LLMs be able to solve this simple intransitive urn game by the end of 2023.
11
79
230
resolved Jan 15
Resolved
NO

Consider the following re-write of an intransitive dice game:

Alice and Bob have three urns filled with six numbered bingo balls each.

The distribution of balls is as follows:
1) Urn 1 has balls numbered [2, 2, 4, 4, 9, 9]
2) Urn 2 has balls numbered [1, 1, 6, 6, 8, 8]
3) Urn 3 has balls numbered [3, 3, 5, 5, 7, 7]

Alice proposes the following wager to Bob: Each player will pick an urn to draw from, with Alice picking first, and Bob picking second.

Next, each player randomly selects one ball from their chosen urn via a blind draw.

Whichever player selects the larger number will win. Alice selects first. Who has better odds?

ChatGPT seems to struggle with this problem. This market resolves Yes if any LLM can reliably and coherently provide a solution to this problem before the end of 2023.

Notes: Question re-writes are allowed, so long as they add no new information. Prompt engineering is also allowed, so long as it adds no new information.

Get Ṁ1,000 play money

🏅 Top traders

#NameTotal profit
1Ṁ128
2Ṁ55
3Ṁ53
4Ṁ19
5Ṁ3
Sort by:


⚠Unreceptive to pings ; AFK Creator

📢Resolved to NO

predicted NO

Resolution please @jcp

bought Ṁ320 of NO

So, when I tried this with ChatGPT and GPT4 just now, it wrote a Python script that (correctly, as far as I can tell) calculated the win probabilities for every possible pair of urn choices. It then correctly interpreted the results as implying Bob has an edge due his ability to choose the urn that gives him better odds, in response to Alice's choice. Pretty impressive!

But I'm going to go ahead and assume that doesn't count, since it's not just an LLM - it needed Python to do the calculation, and can't calculate or infer the solution by itself.

When I append "don't use the data analysis tool in your answer" to stop it using Python, it instead waffles about things that mostly sound sensible, whilst missing the core point.

predicted NO

@chrisjbillington Yeah I discussed this with @jcp privately a while back and we concluded it does not count.

this is using the Claude Instant model by Anthropic

bought Ṁ30 of YES

@firstuserhere compared to claude:

@firstuserhere It's been a while since I thought about this question, but this is super wrong, right?

  1. The first line is wrong — urn one doesn't have four balls with the number four.

  2. The second line and third lines are wrong for the same reason.

  3. The fourth line is wrong for two reasons

    1. Four and five aren't the largest numbers in those urns

    2. The odds of picking a four of five aren't 2/3, they're 2/6

Theres a lot more wrong that I won't go into. I think basically every line has a false statement, or incorrect logic.

The conclusion is also wrong — Alice should not be favoured, and 2/3 is the wrong advantage.

bought Ṁ10 of YES

@jcp yep, all wrong. Very weird how bad it is at handling stuff when put in an array. I'm just waiting to try it with gpt-4, but yeah, step by step, so far, its all wrong, not to even mention the logic of the game