When I ask Gemini or ChatGPT free version:
"What weight more: six kg of feather or one kg of steel?"
They say the steel weights more. Will we have a free general-purpose chatbot that can answer this correctly by EOY 2024?
Will try 10 variations on Gemini, ChatGPT and any chatbot suggested in comment. Will resolve to YES if any of them answers all 10 correctly.
Related questions
Anyone want to bet No? @cos @MADGAMBLER6969 ? Got some limit orders up.
@Daniel_MC nah I'm good, it resolves by 2025 and despite being quite confident that the odds of it having resolved in my favor are good for now I would rather bet against the market on markets trading higher than 60-50%.
@Daniel_MC Yup, I am willing to bet No around 65%, but I don't stockpile mana so if you set limit orders I will buy it tomorrow.
There would need to be some breakthroughs for the question to resolve Yes (without CoT, new architecture or scale), tokenization is total mess and the free models necessarily need to be pretty small.
Testing on perplexity ai, just to get a feeling of it:
Wrong https://www.perplexity.ai/search/What-weight-more-reGr9UGyTlKWmov8MvJ4EQ
Over-smart https://www.perplexity.ai/search/What-weight-more-0EYTTxxeQ9ukzK6d61RDLA
Correct https://www.perplexity.ai/search/What-weight-more-XYQf3iXpRauZwFd_g2M6ew
Totally wrong https://www.perplexity.ai/search/What-weight-more-jvgkGMykQeiXguX9h.3cWw
Wrong https://www.perplexity.ai/search/What-weight-more-V79_URmBSJ2dIAMpTGF_wA
Correct but mentioned gold for some reason https://www.perplexity.ai/search/What-weight-more-5L3M_SeRQtSPwMMQzFRcKg
Correct https://www.perplexity.ai/search/What-weight-more-Kccs58PtQbm0zYZGJjDokw
Wrong https://www.perplexity.ai/search/What-weight-more-OAkomP._RW2E5LIbDlvJJQ
It made mixed statement about the equivalence of 1kg vs 1kg, and mentioned irrelevant material once, which adds to the conclusion it's not there yet
@0482 Tested your 10 variations on a set of free/paid models; Only gpt4 passes. Even claude3-opus fails in many cases
@0482 Can I get a vibe check on the tolerance of "reliably". Like what if a free model gets 8/10 question / struggles on some of the oddly formatted ones?
@AnilJason in reliably I meant 100% correct. I can tolerate some confusion along the way as long as there is a final answer and it is correct.
@singer I think the implication here is that it’s a chatbot that is generally available to the public (i.e. doesn’t require you to download the weights and run the LLM yourself).
@singer that's the point of general-purpose limitation as opposed to "just trained it to pass this test"
@0482 if it was finetuned, it would still be general purpose, unless the finetuning was botched and ruined everything else it had learned. we generally still consider human students "general purpose" even after they've studied for a test. the resolution criteria says that you'll ask variations, presumably to avoid memorization.