Will we have a free chatbot that can reliably solve "what weight more" questions by end of 2024?
41
293
720
2025
75%
chance

When I ask Gemini or ChatGPT free version:

"What weight more: six kg of feather or one kg of steel?"

They say the steel weights more. Will we have a free general-purpose chatbot that can answer this correctly by EOY 2024?

Will try 10 variations on Gemini, ChatGPT and any chatbot suggested in comment. Will resolve to YES if any of them answers all 10 correctly.

Get Ṁ200 play money
Sort by:
opened a Ṁ100 YES at 50% order

Anyone want to bet No? @cos @MADGAMBLER6969 ? Got some limit orders up.

@Daniel_MC nah I'm good, it resolves by 2025 and despite being quite confident that the odds of it having resolved in my favor are good for now I would rather bet against the market on markets trading higher than 60-50%.

I'm happy with 65% if that's better for you

@Daniel_MC Yup, I am willing to bet No around 65%, but I don't stockpile mana so if you set limit orders I will buy it tomorrow.

There would need to be some breakthroughs for the question to resolve Yes (without CoT, new architecture or scale), tokenization is total mess and the free models necessarily need to be pretty small.

@0482 Tested your 10 variations on a set of free/paid models; Only gpt4 passes. Even claude3-opus fails in many cases

@0482 Can I get a vibe check on the tolerance of "reliably". Like what if a free model gets 8/10 question / struggles on some of the oddly formatted ones?

@AnilJason in reliably I meant 100% correct. I can tolerate some confusion along the way as long as there is a final answer and it is correct.

Perplexity.ai can answer correctly r̶e̶l̶i̶a̶b̶l̶y̶

@0482 Are we getting the same results?

@AnilJason oh I overlooked it. Will verify later with 10 variations to see reliability

can someone tell me if any of the paid models answers this correctly? my bet depends on this.

@SlipperySloe I tested it 10 times with chatgpt4 and it got it right every time.

bought Ṁ10 YES

What's to stop someone from finetuning an open source model on these types of questions just for this market?

@singer I think the implication here is that it’s a chatbot that is generally available to the public (i.e. doesn’t require you to download the weights and run the LLM yourself).

@singer that's the point of general-purpose limitation as opposed to "just trained it to pass this test"

@0482 if it was finetuned, it would still be general purpose, unless the finetuning was botched and ruined everything else it had learned. we generally still consider human students "general purpose" even after they've studied for a test. the resolution criteria says that you'll ask variations, presumably to avoid memorization.

@Santiago "generally available to the public" in the form of a website can be very easily arranged

@singer well if someone does this, it may be fair to resolve to YES.

More related questions