Will we get hallucination rates down to human-expert levels by May 2024?
Standard
164
Ṁ37k
resolved May 2
Resolved
NO

I would bet you any sum of money you can get the hallucinations right down into the line of human-expert rate within months.”

— Reid Hoffman, CEO of LinkedIn, Co-Founder of Inflection.AI, September 2023

Gary Marcus: Reid, if you are listening, Gary is in for your bet, for $100,000.

I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.

Only issue is we don't have exact terms (and we don't know if Reid would actually accept for size, although he's definitely good for it if he did.)

If the bet is formalized and made, then this question will resolve to the outcome of the wager, or my judgment of who should have won the wager if the sides dispute who won.

If the bet is never formalized, this resolves to YES if there exists by May 1, 2024 an LLM, that is at least as otherwise capable as GPT-4, that hallucinates in typical conversations on questions where human experts exist at most (about as) often as human experts hallucinate when asked similar questions, to the point where you would treat the opinion as similarly reliable to an expert in terms of checking for hallucinations (fudge factor of 25% or so of the expert rate would be acceptable). Amount of effort and experimentation that goes into resolution will be proportional to trading activity.

Formalizations of terms and resolution are invited in the comments - if there's a good one, I will switch to that as the resolution if the bet is not formalized otherwise.

Standard other Zvi house rules apply (e.g. if this goes to >95% or <5% persistently and I am confident in the outcome I will resolve early, spirit of the question wins in case of ambiguity, etc.)

Get
Ṁ1,000
and
S1.00
Sort by:

How to get a baseline for how often human experts hallucinate — after defining hallucination, that is. How expert? By what criteria? Delivering expertise in what context? Say it’s a golf question. Does expert have to be a winner of a major or just a scratch golfer? Golf Channel commentator only? Elite coach? If the question is about equipment can a Callaway engineer weigh in? What if it’s about the development of women’s golf in South Korea this century, is Rickie Fowler an expert on that? What if it’s about the best drill I Clubmaster Transparent can do to improve my putting? Pretty sure that doesn’t need any of the above.

@ClubmasterTransparent For the non-golfers, South Korean players are indeed a force in professional women’s golf, I didn’t hallucinate THAT.

@ZviMowshowitz how d'ya feel 'bout dat claude opus there, eh??

@beaver1 I do not think it qualifies.

sold Ṁ11 NO

Can get very low hallucination rate by always saying "I don't know". This isn't even cheating that much, as human experts are much less likely to know the answer to any question.

Also interested if this is comparing to a human expert with a library and similar, or a human expert in conversation.

bought Ṁ100 YES

4% is outrageously low, and a cruel under appreciation of the scale of interpretability effort to neutralize confabulation.

2 traders bought Ṁ400 NO

Does it have to be a single LLM? I’ve seen good results when using a collection of three LLMs where two are fact checking the main one.

bought Ṁ50 YES from 12% to 14%

"at most (about as) often as human experts hallucinate"

Do only literal hallucinations count, or does bullshitting (in the Harry Frankfurt sense) also count?

/

bought Ṁ10 YES at 13%

@zkmtx Hm, I think Zvi is an unusually conscientious market creator and put a lot of thought into how to operationalize this market well. If you don't like his criteria here, you can either suggest improvements, or set up your own market with better criteria!

Does the answer have to come purely from inference on the network's weights, or is the model allowed to post-process with RAG or something in that vein?

If only sheer model weights are allowed, then I think it's close to mathematically provable that there will always be hallucinations. The model is only so large and can't encode infinitely many facts without "lossy compression". If RAG-style verification is allowed, then it's just a matter of engineering and therefore only a matter of time (whether by May 2024 or not).

@apetresc If I am judging: You are allowed to use scaffolding, provided it does not impose unreasonable additional costs, and people would actually use the resulting overall system for their general LLM interactions or as parts of various systems.

Are hallucinations here more akin to a mistake (e.g. Napoleon was born in April instead of August) or to making up entire facts (e.g. Napoleon hung out with Karl Marx)?

@SemioticRivalry Those would both count, in my book. Making up that which is not.