— Reid Hoffman, CEO of LinkedIn, Co-Founder of Inflection.AI, September 2023
Gary Marcus: Reid, if you are listening, Gary is in for your bet, for $100,000.
I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.
Only issue is we don't have exact terms (and we don't know if Reid would actually accept for size, although he's definitely good for it if he did.)
If the bet is formalized and made, then this question will resolve to the outcome of the wager, or my judgment of who should have won the wager if the sides dispute who won.
If the bet is never formalized, this resolves to YES if there exists by May 1, 2024 an LLM, that is at least as otherwise capable as GPT-4, that hallucinates in typical conversations on questions where human experts exist at most (about as) often as human experts hallucinate when asked similar questions, to the point where you would treat the opinion as similarly reliable to an expert in terms of checking for hallucinations (fudge factor of 25% or so of the expert rate would be acceptable). Amount of effort and experimentation that goes into resolution will be proportional to trading activity.
Formalizations of terms and resolution are invited in the comments - if there's a good one, I will switch to that as the resolution if the bet is not formalized otherwise.
Standard other Zvi house rules apply (e.g. if this goes to >95% or <5% persistently and I am confident in the outcome I will resolve early, spirit of the question wins in case of ambiguity, etc.)
How to get a baseline for how often human experts hallucinate — after defining hallucination, that is. How expert? By what criteria? Delivering expertise in what context? Say it’s a golf question. Does expert have to be a winner of a major or just a scratch golfer? Golf Channel commentator only? Elite coach? If the question is about equipment can a Callaway engineer weigh in? What if it’s about the development of women’s golf in South Korea this century, is Rickie Fowler an expert on that? What if it’s about the best drill I Clubmaster Transparent can do to improve my putting? Pretty sure that doesn’t need any of the above.
@ClubmasterTransparent For the non-golfers, South Korean players are indeed a force in professional women’s golf, I didn’t hallucinate THAT.
Can get very low hallucination rate by always saying "I don't know". This isn't even cheating that much, as human experts are much less likely to know the answer to any question.
Also interested if this is comparing to a human expert with a library and similar, or a human expert in conversation.
/
@zkmtx Hm, I think Zvi is an unusually conscientious market creator and put a lot of thought into how to operationalize this market well. If you don't like his criteria here, you can either suggest improvements, or set up your own market with better criteria!
Does the answer have to come purely from inference on the network's weights, or is the model allowed to post-process with RAG or something in that vein?
If only sheer model weights are allowed, then I think it's close to mathematically provable that there will always be hallucinations. The model is only so large and can't encode infinitely many facts without "lossy compression". If RAG-style verification is allowed, then it's just a matter of engineering and therefore only a matter of time (whether by May 2024 or not).
@apetresc If I am judging: You are allowed to use scaffolding, provided it does not impose unreasonable additional costs, and people would actually use the resulting overall system for their general LLM interactions or as parts of various systems.