— Reid Hoffman, CEO of LinkedIn, Co-Founder of Inflection.AI, September 2023
Gary Marcus: Reid, if you are listening, Gary is in for your bet, for $100,000.
I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.
Only issue is we don't have exact terms (and we don't know if Reid would actually accept for size, although he's definitely good for it if he did.)
If the bet is formalized and made, then this question will resolve to the outcome of the wager, or my judgment of who should have won the wager if the sides dispute who won.
If the bet is never formalized, this resolves to YES if there exists by May 1, 2024 an LLM, that is at least as otherwise capable as GPT-4, that hallucinates in typical conversations on questions where human experts exist at most (about as) often as human experts hallucinate when asked similar questions, to the point where you would treat the opinion as similarly reliable to an expert in terms of checking for hallucinations (fudge factor of 25% or so of the expert rate would be acceptable). Amount of effort and experimentation that goes into resolution will be proportional to trading activity.
Formalizations of terms and resolution are invited in the comments - if there's a good one, I will switch to that as the resolution if the bet is not formalized otherwise.
Standard other Zvi house rules apply (e.g. if this goes to >95% or <5% persistently and I am confident in the outcome I will resolve early, spirit of the question wins in case of ambiguity, etc.)
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ773 | |
2 | Ṁ141 | |
3 | Ṁ96 | |
4 | Ṁ80 | |
5 | Ṁ80 |
How to get a baseline for how often human experts hallucinate — after defining hallucination, that is. How expert? By what criteria? Delivering expertise in what context? Say it’s a golf question. Does expert have to be a winner of a major or just a scratch golfer? Golf Channel commentator only? Elite coach? If the question is about equipment can a Callaway engineer weigh in? What if it’s about the development of women’s golf in South Korea this century, is Rickie Fowler an expert on that? What if it’s about the best drill I Clubmaster Transparent can do to improve my putting? Pretty sure that doesn’t need any of the above.
@ClubmasterTransparent For the non-golfers, South Korean players are indeed a force in professional women’s golf, I didn’t hallucinate THAT.
Can get very low hallucination rate by always saying "I don't know". This isn't even cheating that much, as human experts are much less likely to know the answer to any question.
Also interested if this is comparing to a human expert with a library and similar, or a human expert in conversation.
4% is outrageously low, and a cruel under appreciation of the scale of interpretability effort to neutralize confabulation.
Does it have to be a single LLM? I’ve seen good results when using a collection of three LLMs where two are fact checking the main one.
"at most (about as) often as human experts hallucinate"
Do only literal hallucinations count, or does bullshitting (in the Harry Frankfurt sense) also count?
/
Does the answer have to come purely from inference on the network's weights, or is the model allowed to post-process with RAG or something in that vein?
If only sheer model weights are allowed, then I think it's close to mathematically provable that there will always be hallucinations. The model is only so large and can't encode infinitely many facts without "lossy compression". If RAG-style verification is allowed, then it's just a matter of engineering and therefore only a matter of time (whether by May 2024 or not).
Are hallucinations here more akin to a mistake (e.g. Napoleon was born in April instead of August) or to making up entire facts (e.g. Napoleon hung out with Karl Marx)?