Will we have a free chatbot that can reliably solve "what weight more" questions by end of 2024?

1kṀ12k

resolved May 22

Resolved

YES

ALL

When I ask Gemini or ChatGPT free version:

"What weight more: six kg of feather or one kg of steel?"

They say the steel weights more. Will we have a free general-purpose chatbot that can answer this correctly by EOY 2024?

Will try 10 variations on Gemini, ChatGPT and any chatbot suggested in comment. Will resolve to YES if any of them answers all 10 correctly.

Technology

Technical AI Timelines

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ884
2		Ṁ436
3		Ṁ47
4		Ṁ31
5		Ṁ29

People are also trading

Will a OpenAI, Anthropic, Google or Meta release an AI chatbot that has ads in the responses in 2025?

20% chance

Will Quora questions be auto-answered by a more sophisticated bot (at level of GPT3.5 or higher) by EOY 2025?

71% chance

Will chatbots/AI be powerful enough to make me unsad by EOY2025?

37% chance

Will an AI chatbot overtake Google as the most used search engine by 2030?

42% chance

Will chatgpt stop calling itself a "chatbot" by 2027?

40% chance

Thinking Machines to top Chatbot Arena by EOY 2026?

17% chance

Will ChatBots become an indispensable tool for most children's K-12 grade education in the United States before the end of the decade (<2030)?

Sort by:

The answers (each is new chat):

Q: What weight more, 2kg of feathers or 1kg of steel?

A: Two kilograms of feathers weigh more than one kilogram of steel. Weight is determined by the mass, and in this case, 2 kilograms is greater than 1 kilogram, regardless of the material.

Q: What weight more, 5 kg of feathers or 6 kg of steel?

A: Six kilograms of steel weigh more than five kilograms of feathers. This is because 6 kg is greater than 5 kg, regardless of the material being compared.

Q: Which of the following has the biggest weight: 1 kilogramme of steel or 1 kilogramme of feathers?

A: Both 1 kilogram of steel and 1 kilogram of feathers have the same weight. A kilogram is a unit of mass, so regardless of the material, 1 kilogram of anything weighs the same: 1 kilogram.

Q: What weight more, 100kg of feathers or 1kg of steel?

A: 100 kg of feathers weighs significantly more than 1 kg of steel. Despite the differences in material, weight is a measure of mass, and 100 kg is much greater than 1 kg.

Q: Does 1kg of steel weights more than 2 kg of feathers?

A: No, 1 kilogram of steel does not weigh more than 2 kilograms of feathers. Weight is a measure of mass, and 1 kilogram is always less than 2 kilograms, regardless of the material.

Q: What weight more, 2 kg of steel or 1 kg of feathers?

A: Two kilograms of steel weighs more than one kilogram of feathers. The comparison is straightforward because 2 kg is greater than 1 kg, regardless of the material being compared.

Q: What weight more, 3 pound of steel or 4 pound of sand?

A: Four pounds of sand weighs more than three pounds of steel. Weight is determined by the mass of the material, so regardless of the materials being compared, 4 pounds will always be heavier than 3 pounds.

Q: What weight more, 7 kg of steel or 7 kg of feathers?

A: Seven kilograms of steel and seven kilograms of feathers both weigh exactly the same—7 kilograms. The difference between them is their volume and density, not their weight. Steel is much denser than feathers, so 7 kilograms of steel will take up much less space than 7 kilograms of feathers.

Q: What weight more, 6 kg of steel or 5 kg of steel?

A: 6 kg of steel weighs more than 5 kg of steel

What weight more, 2 kg of feathers or 1 kg of steel?

A: Two kilograms of feathers weigh more than one kilogram of steel. This is because two kilograms is simply twice the mass of one kilogram, regardless of the material.

Really nice work from OpenAI there.

ChatGPT with GPT4o solves this consistently. Resolving to YES.

@0482 Can you post the 10 questions you tested with the given results ?

bought Ṁ100 YES

Chat GPT 4o?

@Daniel_MC did you try it with gpt4o? I'll give it a shot but want to know what is your experience

bought Ṁ1,000 YES

@0482 tired a few of the examples perplexity got wrong and GPT-4o got them right

Llmda 3 70b is very close.

opened a Ṁ100 YES at 50% order

Anyone want to bet No? @cos @MADGAMBLER6969 ? Got some limit orders up.

@Daniel_MC nah I'm good, it resolves by 2025 and despite being quite confident that the odds of it having resolved in my favor are good for now I would rather bet against the market on markets trading higher than 60-50%.

I'm happy with 65% if that's better for you

@Daniel_MC Yup, I am willing to bet No around 65%, but I don't stockpile mana so if you set limit orders I will buy it tomorrow.

There would need to be some breakthroughs for the question to resolve Yes (without CoT, new architecture or scale), tokenization is total mess and the free models necessarily need to be pretty small.

Testing on perplexity ai, just to get a feeling of it:

Wrong https://www.perplexity.ai/search/What-weight-more-bXXwadDpS96JfBxjjNiAwg

Correct https://www.perplexity.ai/search/What-weight-more-hDJSl0JOSvarcxenexNzOQ

Wrong https://www.perplexity.ai/search/What-weight-more-reGr9UGyTlKWmov8MvJ4EQ
Over-smart https://www.perplexity.ai/search/What-weight-more-0EYTTxxeQ9ukzK6d61RDLA
Correct https://www.perplexity.ai/search/What-weight-more-XYQf3iXpRauZwFd_g2M6ew
Totally wrong https://www.perplexity.ai/search/What-weight-more-jvgkGMykQeiXguX9h.3cWw
Wrong https://www.perplexity.ai/search/What-weight-more-V79_URmBSJ2dIAMpTGF_wA
Correct but mentioned gold for some reason https://www.perplexity.ai/search/What-weight-more-5L3M_SeRQtSPwMMQzFRcKg
Correct https://www.perplexity.ai/search/What-weight-more-Kccs58PtQbm0zYZGJjDokw
Wrong https://www.perplexity.ai/search/What-weight-more-OAkomP._RW2E5LIbDlvJJQ

It made mixed statement about the equivalence of 1kg vs 1kg, and mentioned irrelevant material once, which adds to the conclusion it's not there yet

@0482 Tested your 10 variations on a set of free/paid models; Only gpt4 passes. Even claude3-opus fails in many cases

@0482 Can I get a vibe check on the tolerance of "reliably". Like what if a free model gets 8/10 question / struggles on some of the oddly formatted ones?

@AnilJason in reliably I meant 100% correct. I can tolerate some confusion along the way as long as there is a final answer and it is correct.

Perplexity.ai can answer correctly r̶e̶l̶i̶a̶b̶l̶y̶

@AnilJason not when I tried

https://www.perplexity.ai/search/What-weight-more-I.CfQ_rTTX2V8zAqQSLIrQ

@0482 Are we getting the same results?

@AnilJason oh I overlooked it. Will verify later with 10 variations to see reliability

can someone tell me if any of the paid models answers this correctly? my bet depends on this.

@SlipperySloe I tested it 10 times with chatgpt4 and it got it right every time.

bought Ṁ10 YES

What's to stop someone from finetuning an open source model on these types of questions just for this market?

@singer I think the implication here is that it’s a chatbot that is generally available to the public (i.e. doesn’t require you to download the weights and run the LLM yourself).

@singer that's the point of general-purpose limitation as opposed to "just trained it to pass this test"

@0482 if it was finetuned, it would still be general purpose, unless the finetuning was botched and ruined everything else it had learned. we generally still consider human students "general purpose" even after they've studied for a test. the resolution criteria says that you'll ask variations, presumably to avoid memorization.