Will an LLM be able to solve confusing but elementary geometric reasoning problems in 2024? (strict LLM version)
➕
Plus
34
Ṁ27k
resolved Jan 2
Resolved
NO

This is a variant of the following market:

https://manifold.markets/dreev/will-an-llm-be-able-to-solve-confus

In this version, the problem has to be solved purely by the LLM itself.

Open question: Does GPT-o1 count as "strictly an LLM"? Seems super ambiguous to me. I've sold my shares in this market so I can just make a judgment call. The default is yes, it counts, but if I hear a compelling counterargument in the comments, I'll make an update.

Get
Ṁ1,000
and
S3.00
Sort by:
sold Ṁ768 NO

I've just sold my position in this market. Hopefully that means I can just make the judgment call on, assuming ChatGPT o1 can do this, whether that counts as being solved "purely by the LLM itself". Can I hear people's arguments for and against?

Sanity check: if GPT-o1 were to pull this off in time, that still counts as a strict LLM, right?

@dreev i think so

bought Ṁ250 NO

@dreev this will be increasingly challenging as more and more models are integrated into a single system, which is in part why I don't bet much on the "Will LLMs do [task] by [future year]?" markets, but yeah, I think it's reasonable to call GPT-o1 an LLM for the rest of 2024.

@Jacy Thank you, that makes a ton of sense. I shall avoid trying to single out LLMs in the future and hope that this one won't turn out too painful to adjudicate over the remaining 3 months in 2024. If anyone has any counterpoints about GPT-o1, chime in! (Not that it matters so far, with GPT-o1 failing our flagship geometric reasoning problem so far, but it does seem to be getting closer... 😬)

bought Ṁ100 YES

@dreev bet up on this one since if o1 counts as an LLM I think this also resolves yes?

@JohnCarpenter Oh, yeah, great question. Does o1 count as purely an LLM? [ps, ha, originally when i replied here i didn't notice that this was in fact exactly the original question at the top of this thread. i'm still highly uncertain but we'll go with yes unless anyone has a good counterargument]

@dreev /shrug idk what the word pure means. It's one model that queries itself a lot of times in a row

@JohnCarpenter Yeah this is very nonobvious to me. There's presumably additional code for constructing the chain-of-reasoning prompts.

@dreev I'd say o1 counts unless it uses code execution to help it solve the problem.

@MugaSofer That's sounding reasonable to me. If anyone has counterarguments, now's the time to make them. At this point the default is that this is going to resolve the same way the parent market does.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules