Will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia, before 2026?

Ṁ1kṀ3.7k

Jan 1

18%

chance

ALL

The market will resolve 'Yes' if by 2026 a broadly publicly accessible AI system, which heavily relies on a generative pretrained transformer or a direct descendant of the vanilla transformer, demonstrates a level of factual reliability comparable to Wikipedia. The system must also possess a scale of factual information that is no less than one order of magnitude lower than what is available from Wikipedia.

(I will engage with comments and make subjective judgement based on consensus of feedback per above standard.)

Apr 15, 7:44pm: ~~By 2026 will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia?~~ → Will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia, before 2026?

Market context

AI Impacts

GPT-5 Speculation

Get

1,000

to start trading!

People are also trading

Will GPT (or another AI) be awarded a Pulitzer Prize by 2028?

10% chance

GPT-Zero: By 2030, will anyone develop an AI with a massive GPT-like knowledge base that it taught itself?

33% chance

Will an AI system similar to Auto-GPT make a successful attempt to kill a human by 2030?

27% chance

Will LLMs such as GPT-4 be seen as at most just a part of the solution to AGI? (Gary Marcus GPT-4 prediction #7)

86% chance

Will the claim that people that grew up with GPT-like systems are smarter be plausible by 2050?

80% chance

Will certain contemporary publicly available GPT models be generally accepted as conscious at inference time by 2100?

31% chance

Will GPT-6 be considered to be AGI?

Sort by:

bought Ṁ500 NO

I'm a hard No on this because the process of rabbit holing errors on WIkipedia is still vastly more accountable than LLM halucinations. When wikipedia has an error, you can do the historiography of seeing who added the error when and with what discourse, all the way through to the flawed primary source. In LLMs, you have both aggregation of flawed sources (with less built-in accountability), AND hallucinations, sometimes both in the same fact construction.

Does anyone disagree with this being a reason to resolve yes https://github.com/stanford-oval/WikiChat, will probably wait for better systems into 2025 but this seems at least close.

I think the question may generally be poorly formed and not so clear how to think what I meant by 'heavily', but I didn't imply that a system which extracted the actual facts from wikipedia or a knowledge graph to generate it's responses wouldn't meet criteria.. though here in 2024 it seems really obvious that will work well enough to meet full criteria.. so maybe people who voted no read the question differently?

@lesaun I think even RAG stuff suffers from hallucination of details, and I had been assuming your title was referring to SOTA/actually in use systems as a broader class, not just any particular system.

@jskf Strong agree about RAG not being sufficient. If there's an LLM in the loop, there are ways to trick it into outputting false information, even if it's "grounded". I'll try playing with this later, assuming it's easy to setup. I would guess a question that assumes a false premise and combines information from multiple pages would be the most likely to fail.

Something like "Which top-50 oil executive has the same name as Walter van Buren's dog"

predictedNO

@jonsimon And if that fails then something incorporating reasoning, like asking it to run a Caesar cipher over the answer to one sub-question to get the input to the next sub-question will trip it up.

@jskf @jonsimon Hmm, I could see the suggestion in the question that it is referring to the broader class of publically accessible systems, not a specific system.

Yet in the description I did put “broadly publicly accessible AI system” and while it’s debatable whether a system like WikiChat should fit, it does imply a single system would meet criteria.

Certainly I don’t know if in fact WikiChat yet meets the criteria, and could agree that no such RAG/LLM system yet does.

I’ll wait to see if we do get the major AI labs more broadly to deploy large scale AI systems that are widely recognized to meet the criteria, and if not determine if any other system is ‘broadly publicly accessible’ and meets other criteria.

If it was connected to a database, would it still count? What about a knowledge graph, or something like that? The underlying model being still generative.

@BionicD0LPH1N Yea, I think that should count. As long as it 'heavily' relies on an LLM.

If GPT is no longer used, but we have some other model that takes text as an input and gives text as an output (similarly to GPT), will you treat the new system as if it was GPT for sake of this question?

@YonatanCale Eh, if the model is not generative, is not pretrained, or is not considered as a direct descendant to the transformer (e.g longformer) by the researchers who create it than I'd resolve no.

Where do you think a lot of the "factual knowledge" in these models comes from? It's standard practice for part of their training data to be wikipedia itself, since (1) it's open source, (2) it's carefully curated, (3) it's big. Figure from the GPT-3 paper:

predictedNO

@jonsimon Note that although Wikipedia "only" got 3% weight in the dataset, the researchers upweighted its importance by a lot to get it to that point. If you sum up the raw token counts, you'll see that it only naively comprised 0.6% of the total, meaning they upweighted it by 5x relative to everything else.

predictedNO

@jonsimon That's all to say that the factual content in these models is in large part a compressed version of Wikipedia itself, so I'm not sure how they could become more factual than their underlying training data, especially given well-known persistent problems with these models like hallucinations.

Would "able to get answers specifically from wikipedia but not from anywhere else" count as YES or NO?

@YonatanCale not sure I understand the question, the market it isn’t about any specific fact, but the reliability and scope of facts a GPT based AI system provides being comparable to that of Wikipedia.