Will LLM hallucinations be a fixed problem by the end of 2025?

701

3.7kṀ240k

2026

10%

chance

ALL

https://fortune.com/2023/08/01/can-ai-chatgpt-hallucinations-be-fixed-experts-doubt-altman-openai/

Tech experts are starting to doubt that ChatGPT and A.I. 'hallucinations' will ever go away: 'This isn’t fixable'

Experts are starting to doubt it, and even OpenAI CEO Sam Altman is a bit stumped.

“This isn’t fixable,” said Emily Bender, a linguistics professor and director of the University of Washington’s Computational Linguistics Laboratory. “It’s inherent in the mismatch between the technology and the proposed use cases.”

How true will this end up being? At the end of 2025 I will evalaute whether the hallucination problem for LLMs has been fixed or still exists. If hallucinations have been solved this market resolves YES. If the outstanding hallucination problem still exists, this market will resolve NO.

Related market:

Sora/Hollywood

Technology

Technical AI Timelines

Science

Get

1,000

to start trading!

People are also trading

Will hallucinations (made up facts) created by LLMs go below 1% on specific corpora before 2025?

41% chance

LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?

60% chance

How Will the LLM Hallucination Problem Be Solved?

Will LLMs mostly overcome the Reversal Curse by the end of 2025?

59% chance

Will LLMs become a ubiquitous part of everyday life by June 2026?

82% chance

Will RL work for LLMs "spill over" to the rest of RL by 2026?

35% chance

Will scaling current methods be enough to eliminate LLM hallucination?

15% chance

Will there be major breakthrough in LLM Continual Learning before 2026?

25% chance

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

27% chance

Will an LLM consistently create 5x5 word squares by 2026?

Sort by:

When do we stop calling it an LLM? If improvements are about tacking on modules to further process outputs, can it be said the problem is solved for LLMs or is this a new architecture?

Resolution criteria is unclear. But claude sonnet is very good https://bsky.app/profile/emollick.bsky.social/post/3lk4wuvkeps2o

@kindgracekind I don't know if the findings in this article only apply to medicine, or what, but I use claude sonnet pretty much daily at this point and it hallucinates a lot more than this article seems to suggest. In fact this article is talking about 3.5, and I've experienced pretty frequent hallucinations with 3.7 too, I definitely wouldn't call it fixed.

There's not going to be a clear moment in which everyone agrees the problem is solved. Even if 99% of the time it didn't happen, Gary Marcus would still be like "Look, I made the computer do the thing".

@PhilosophyBear Even humans hallucinate some of the time, after all.

bought Ṁ5 NO

The resolution is unclear though. Would that be any of them fix it? Who do we trust to say it's fixed?

DeepMind claims to be making progress on separating epistemic and aleatory randomness, which would solve hallucinations.

https://arxiv.org/pdf/2406.02543

The ContextCite method which looks at the counterfactual logits when you ablate a source seems simple and potentially effective.

https://gradientscience.org/contextcite/

In general, "hallucinations" are caused by the fact that the text prior generalizes by being a stochastic model, in order to generalize it has to model all of the possibilities implied by the data not just the particular timeline in which it finds itself. However, it is possible to constrain the sampling from the text prior with e.g. an embedding as I do in my AdaVAE experiments:

https://www.greaterwrong.com/posts/4Hnso8NMAeeYs8Cta/revealing-intentionality-in-language-models-through-adavae

I would expect that by the end of 2025 someone will have noticed that you can do in-context decoding of an embedding and use this to fit retrieved memories to the local context in a way that minimizes hallucinations. Right now we don't actually pair our decoder-only transformer language models with an encoder, but in principle you could and if there's a lot of pressure to solve hallucinations I don't see why you wouldn't eventually. Since I know the problem is solvable in principle and the incentives to solve it are overwhelming it would be fairly weird to me if nobody did.

A related paper maybe making progress:https://techxplore.com/news/2024-05-method-mitigate-hallucinations-large-language.html

@SneakySly Only skimmed the paper, but I'm not very impressed by it as progress towards "solving LLM hallucination". It's not clear to me what about their method is new. They also focus on QA tasks with clearly defined right answers, which their method is relatively easy to apply to. Real world use is often much more open ended.

The nature of LLMs as next-token predictors might make this intractable IMO.

Could you clarify what you mean by "LLM", and the "hallucination problem" ?

As some other people have asked, does an AI with an LLM bootstraped to some other technique count ? How much of the AI needs to be the LLM to meet the resolution criterion ?

As for the hallucination problem, do you mean the specific problem of LLM hallucinations, compared to human hallucination for exemple, or do you mean any degree of hallucination, even if the AI is superhuman at not making stuff up ? Does it count if one can make them hallucinate through prompt engineering, even if they are otherwise hallucinating less than humans ?

@PierreLamotte For example if we get a new ChatGPT that is LLM + other stuff and that solves it, that will count. If a new ChatGPT had no LLM aspect at all (totally new architecture) then that would not count.

Hallucinations I clarified in the linked market with the same criteria but different year:

"
- What if very few everyday users can elicit hallucinations, but adversarial prompts created by experts still can?
Depends on the prompts. I think that if I personally can copy a prompt that follows the definitions already established and elicit a hallucination then the market resolves NO.
"

It's a tough bar!

@SneakySly I gave the following in the other thread:

True, if OpenAI claimed that GPT-X hallucinated 99% less often that should resolve this market as YES. For the spirit of this question that is essentially solving the problem. How do we like operationalizing this as if someone posts a reputable article indicating a 95%+ level of hallucination reduction we can say that hallucinations were not an intractable hurdle like these experts claimed and the market can resolve YES. (Benchmarks would work as well if they get created)

@SneakySly GPT-X compared to what? Compared to release GPT-4? Compared to GPT-3.5? Compared to GPT-(X-1)?

@SneakySly (only now seeing this) fwiw I had not interpreted this market as trying to operationalize hallucinations being an intractable hurdle (which I don't believe).

You can almost always find hallucinations if you're looking for them, even in bleeding-edge LLMs. One pretty reliable method is to take a well-known trick question and modify it to remove the trick. Even the top LLMs will often still respond as if you'd asked the original version of the trick question, rather than what you actually asked. Here's an example from GPT4o:

Memory is flawed. Humans hallucinate all the time just like gpt4

predictedYES

I'm surprised that this market isn't higher, it seems that there has been notable progress on the accuracy of llm-generated content over the last ~1 year, and this seems likely to continue.

Its plausible that hallucinations are an inherent artifact of such stochastic generative systems, but "solving" hallucinations appears to be reducible to a problem of detecting when they occur, and supplementing with external sources or otherwise signaling the uncertainty. Perhaps I'm unclear on the resolution criteria, but as I understand it the 98% accuracy with RAG described by openai would have been sufficient to resolve yes (If it was actually achieved).

There also seems to be some tradeoff between an llm being able to provide a correct answer and not providing incorrect answers, as you can always raise the bar of certainty required for a fact to be stated. I think a system that tells the right answer 80% of the time, is wrong 0.1%, and otherwise explains it does not know is more useful than one that answers correctly 99% and is wrong 1%. If such a ratio as 80%/19.9%/0.1% is plausible with fine-tuning current systems to hedge more frequently, would this pass for hallucinations being solved? If the rate of falsehoods is the primary factor, then I feel it should, as it would be greater than human level.

Its also worth considering that hallucinations are one of primary (if not THE primary?) roadblocks preventing more widespread use of current llms. It seems likely that huge resources will be put into improving their factual accuracy over the next two years.

Is there something in particular that I'm missing?

@CollinGray you are describing a machine that can tell if an arbitrary statement is true or not, we can't do that as humans, why do you think it's going to be easy to do with an llm?

Otherwise I agree with you I'd rather have something that's only wrong 0.1% of the time, even if I had to sacrifice some answers that were correct but not confident. I disagree with your numbers though, 80% of the time is not good enough, that would be too low a threshold and would result in the ai only being able to answer trivially true things.

predictedYES

@Odoacre You're right that 80% of questions being answered is too low to be very useful, but my larger point with that was that "solving hallucinations" is a very fuzzy target, as you can always reduce fake answers at the cost of correct answers by rewarding hedging, and so I'm unclear about what a system which fulfills this market would look like.

As for your first point, I agree that determining whether an arbitrary statement is true or not is very difficult, but llms have more information to work with, i.e. a probability distribution over all possible answers. The assumption here is that unlike hallucinated facts, memorized facts are far more self consistent. For example, "Washington" will dominate the next-token probability distribution(s) for "The 1st president was", but "The 200th president was" will result in a distribution with multiple top possibilities. For a better explanation than I can give, check out this paper

This depends largely on how you define an "LLM"

could an LLM plus a bunch of non-LLM infrastructure be forced to forgo all hallucinations?
Possibly

could a different architecture to current day LLMs which actually performs cognition like operations on it's internal knowledge, but which superficially resembles LLMs externally forgo all hallucinations?
Possibly

Could a pure LLM forgo hallucination?
Not with any training dataset which presently exists.

There's many degrees to "fixing" hallucinations. If we're talking about always stating things with 100% accuracy or replying with "unknown", then the LLMs will be restricted to producing mathematical formulas alone as anything else could be potentially attacked as a "hallucination" due to not being precise enough.

If we're talking about being less prone to hallucinations that a human with access to Google, then it's a much more realistic proposition and I would be YES on that.

predictedYES

Relevant - because if they did, then it's more or less solved.

predictedNO

@firstuserhere Eh, it's highly doubtful that getting a high benchmark in that kind of controlled environment is equivalent to solving the problem, imo. Like, I would still be very surprised if it doesn't still hallucinate about things right at the edge of its knowledge context or when it comes to more niche domain-specific questions.