AI honesty #1: by 2027 will we have AI that doesn't hallucinate random nonsense?

1kṀ7374

2027

40%

chance

ALL

By "hallucinate random nonsense" I mean the thing where language models will occasionally just say totally ridiculous out of left field stuff, or blatantly be wrong about the current conversational context (e.g. tell you you said a thing you very definitely did not say).

Resolves YES if my subjective impression is that language models basically don't do this anymore, or if we end up getting a reasonable benchmark for this and very good performance on that benchmark.

Technical AI Timelines

Technical AI Safety

Get

1,000

to start trading!

People are also trading

AI honesty #4: by 2027, will we have AI that would tell us if it was planning on destroying us (conditional on that being true)?

22% chance

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

25% chance

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

48% chance

Will I believe any AI system is conscious before 2027?

35% chance

Will hallucinations (made up facts) created by LLMs go below 1% on specific corpora before 2025?

41% chance

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

15% chance

Will AI be smarter than any one human probably around the end of 2025?

16% chance

Will there be a well accepted formal definition for honesty in AI by 2027?

23% chance

Will AI video become indistinguishable from reality by 2030?

78% chance

xAI builds truth-seeking AI before 2027?

Sort by:

sold Ṁ47 NO

I am selling my investment before the value of Mana is decreased to a tenth of its current value on May 1 2024.

Can you give some recent examples? They will likely never be 100% factual, and will always confabulate to some extent, same as humans.

What's your threshold for "totally ridiculous out of left field stuff"?

"On internal evaluations, GPT-4-launch scores 19 percentage points higher than our latest GPT-3.5 model at avoiding open-domain hallucinations, and 29 percentage points higher at avoiding closed-domain hallucinations." interesting

https://cdn.openai.com/papers/gpt-4-system-card.pdf

What do we count as “very good performance on benchmark”? Average human level?

People hallucinate stuff all the time

@TomCohen 1. I'm not sure what people you've been talking to but the ones I know don't hallucinate very often. They're wrong all the time but that's not what the question is asking. 2. Hard to answer the main question, it will depend on the benchmark. For instance if the benchmark in some way measured how often a language model makes a blatantly false claim about the conversational context (e.g. I say blue, the model immediately tells me I said red), I would want human level performance. On the other hand if it measured some proxy that also included easily searchable facts I would expect better than human performance (e.g. "how tall is the Eiffel tower", the model should do much better than a human).

@vluzko I say blue, the human tells me I said red, is alas a common failure mode of human human discussion. See also straw man, gaslighting, misunderstanding, failure to listen.

I think a key difference is that humans do this for human reasons and language models do this for inscrutable language model reasons.

@MartinRandall The example isn't a metaphor.

@vluzko Same. When I got my house painted, I had to provide all the colors in writing, and then I e-signed a document from the painter with all the colors written out, and then the painter phoned me the day before and talked me through the colors one-by-one with me confirming as I went. The double and triple checks are there because humans saying red and meaning blue, or hearing red and thinking blue, is a common enough failure mode to warrant those checks.

Here's an example of ChatGPT making blatantly false claims about the conversational context: https://withoutbullshit.com/blog/chatgpt-is-a-bullshitter

The fastest marine mammal is the peregrine falcon. [...]

[...] I did not mean to say that the peregrine falcon is a marine mammal. I simply mentioned it as an example of the fastest animal in the world.

This is "the model says a false thing, and later, when challenged, says that it didn't mean to say the false thing". Also a common human failure mode. After searching for a while I couldn't find ChatGPT examples of "I say blue, the model immediately tells me I said red", and I confess I did think this was a metaphor. Perhaps you could link the conversation you are referencing?

I can't tell I am communicating with humans in contexts where hallucinations are more likely, or whether I am noticing human hallucinations more often, or whether I have interacted less with language models or in less tricky ways, or whether you're experiencing the same rate of human hallucinations as me, but simply describe it as not "very often" where I describe the same rate as "common".

@MartinRandall I can't link the conversation, it was under NDA.

Suppose we have a language model that avoids hallucination by sticking to facts supported by documents that it retrieves, and this does very well on a benchmark measuring hallucination. Suppose we also have other widely used language models which still suffer from hallucination. How does this question resolve?

@MichaelChen if this non hallucinatory model is SOTA (or SOTA adjusting for scaling) I'll resolve YES. If it's only good at not hallucinating that doesn't count. Good question btw

Language models that predict next tokens from a human corpus must generate random nonsense at least some of the time that I went swimming and it was a good day overall but then I lost my bucket, and I remember getting home and seeing my tears in the mirror what humans generate, which is their training objective.