Will we get hallucination rates down to human-expert levels by May 2024?

164

1.9kṀ37k

resolved May 2

Resolved

ALL

“I would bet you any sum of money you can get the hallucinations right down into the line of human-expert rate within months.”
— Reid Hoffman, CEO of LinkedIn, Co-Founder of Inflection.AI, September 2023
Gary Marcus: Reid, if you are listening, Gary is in for your bet, for $100,000.

I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.

Only issue is we don't have exact terms (and we don't know if Reid would actually accept for size, although he's definitely good for it if he did.)

If the bet is formalized and made, then this question will resolve to the outcome of the wager, or my judgment of who should have won the wager if the sides dispute who won.

If the bet is never formalized, this resolves to YES if there exists by May 1, 2024 an LLM, that is at least as otherwise capable as GPT-4, that hallucinates in typical conversations on questions where human experts exist at most (about as) often as human experts hallucinate when asked similar questions, to the point where you would treat the opinion as similarly reliable to an expert in terms of checking for hallucinations (fudge factor of 25% or so of the expert rate would be acceptable). Amount of effort and experimentation that goes into resolution will be proportional to trading activity.

Formalizations of terms and resolution are invited in the comments - if there's a good one, I will switch to that as the resolution if the bet is not formalized otherwise.

Standard other Zvi house rules apply (e.g. if this goes to >95% or <5% persistently and I am confident in the outcome I will resolve early, spirit of the question wins in case of ambiguity, etc.)

Market context

Technology

Science

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ773
2		Ṁ141
3		Ṁ96
4		Ṁ80
5		Ṁ80

People are also trading

Before 2027, will a frontier AI model achieve an AA-Omniscience hallucination rate below 5%?

32% chance

Will LLM hallucinations be a fixed problem by the end of 2028?

42% chance

Before 2027, will OpenAI release a frontier model with a 5:1 or better abstention to hallucination ratio on SimpleQA?

51% chance

AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

45% chance

Will 25% of Americans be cognitively augmented humans by 2040?

19% chance

Will avg psychedelics/employee use at OpenAI in 2026 be less than 70% of its use rate in 2025?

59% chance

Will there be an attempt of a radical cognitive enhancement program with clinical trials on humans by January 1, 2030?

37% chance

Will 1% of Americans be cognitively augmented humans by 2033?

13% chance

Will 25% of EU citizens be cognitively augmented humans by 2040?

19% chance

Will scaling current methods be enough to eliminate LLM hallucination?

Sort by:

How to get a baseline for how often human experts hallucinate — after defining hallucination, that is. How expert? By what criteria? Delivering expertise in what context? Say it’s a golf question. Does expert have to be a winner of a major or just a scratch golfer? Golf Channel commentator only? Elite coach? If the question is about equipment can a Callaway engineer weigh in? What if it’s about the development of women’s golf in South Korea this century, is Rickie Fowler an expert on that? What if it’s about the best drill I Clubmaster Transparent can do to improve my putting? Pretty sure that doesn’t need any of the above.

@ClubmasterTransparent For the non-golfers, South Korean players are indeed a force in professional women’s golf, I didn’t hallucinate THAT.

@ZviMowshowitz how d'ya feel 'bout dat claude opus there, eh??

@beaver1 I do not think it qualifies.

sold Ṁ11 NO

Can get very low hallucination rate by always saying "I don't know". This isn't even cheating that much, as human experts are much less likely to know the answer to any question.

Also interested if this is comparing to a human expert with a library and similar, or a human expert in conversation.

bought Ṁ100 YES

4% is outrageously low, and a cruel under appreciation of the scale of interpretability effort to neutralize confabulation.

Does it have to be a single LLM? I’ve seen good results when using a collection of three LLMs where two are fact checking the main one.

"at most (about as) often as human experts hallucinate"

Do only literal hallucinations count, or does bullshitting (in the Harry Frankfurt sense) also count?

@zkmtx Hm, I think Zvi is an unusually conscientious market creator and put a lot of thought into how to operationalize this market well. If you don't like his criteria here, you can either suggest improvements, or set up your own market with better criteria!

Does the answer have to come purely from inference on the network's weights, or is the model allowed to post-process with RAG or something in that vein?

If only sheer model weights are allowed, then I think it's close to mathematically provable that there will always be hallucinations. The model is only so large and can't encode infinitely many facts without "lossy compression". If RAG-style verification is allowed, then it's just a matter of engineering and therefore only a matter of time (whether by May 2024 or not).

@apetresc If I am judging: You are allowed to use scaffolding, provided it does not impose unreasonable additional costs, and people would actually use the resulting overall system for their general LLM interactions or as parts of various systems.

Are hallucinations here more akin to a mistake (e.g. Napoleon was born in April instead of August) or to making up entire facts (e.g. Napoleon hung out with Karl Marx)?