The market will resolve 'Yes' if by 2026 a broadly publicly accessible AI system, which heavily relies on a generative pretrained transformer or a direct descendant of the vanilla transformer, demonstrates a level of factual reliability comparable to Wikipedia. The system must also possess a scale of factual information that is no less than one order of magnitude lower than what is available from Wikipedia.
(I will engage with comments and make subjective judgement based on consensus of feedback per above standard.)
Apr 15, 7:44pm: By 2026 will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia? → Will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia, before 2026?
Does anyone disagree with this being a reason to resolve yes https://github.com/stanford-oval/WikiChat, will probably wait for better systems into 2025 but this seems at least close.
I think the question may generally be poorly formed and not so clear how to think what I meant by 'heavily', but I didn't imply that a system which extracted the actual facts from wikipedia or a knowledge graph to generate it's responses wouldn't meet criteria.. though here in 2024 it seems really obvious that will work well enough to meet full criteria.. so maybe people who voted no read the question differently?
@lesaun I think even RAG stuff suffers from hallucination of details, and I had been assuming your title was referring to SOTA/actually in use systems as a broader class, not just any particular system.
@jskf Strong agree about RAG not being sufficient. If there's an LLM in the loop, there are ways to trick it into outputting false information, even if it's "grounded". I'll try playing with this later, assuming it's easy to setup. I would guess a question that assumes a false premise and combines information from multiple pages would be the most likely to fail.
Something like "Which top-50 oil executive has the same name as Walter van Buren's dog"
@jonsimon And if that fails then something incorporating reasoning, like asking it to run a Caesar cipher over the answer to one sub-question to get the input to the next sub-question will trip it up.
@jskf @jonsimon Hmm, I could see the suggestion in the question that it is referring to the broader class of publically accessible systems, not a specific system.
Yet in the description I did put “broadly publicly accessible AI system” and while it’s debatable whether a system like WikiChat should fit, it does imply a single system would meet criteria.
Certainly I don’t know if in fact WikiChat yet meets the criteria, and could agree that no such RAG/LLM system yet does.
I’ll wait to see if we do get the major AI labs more broadly to deploy large scale AI systems that are widely recognized to meet the criteria, and if not determine if any other system is ‘broadly publicly accessible’ and meets other criteria.
@YonatanCale Eh, if the model is not generative, is not pretrained, or is not considered as a direct descendant to the transformer (e.g longformer) by the researchers who create it than I'd resolve no.
@jonsimon Note that although Wikipedia "only" got 3% weight in the dataset, the researchers upweighted its importance by a lot to get it to that point. If you sum up the raw token counts, you'll see that it only naively comprised 0.6% of the total, meaning they upweighted it by 5x relative to everything else.
@jonsimon That's all to say that the factual content in these models is in large part a compressed version of Wikipedia itself, so I'm not sure how they could become more factual than their underlying training data, especially given well-known persistent problems with these models like hallucinations.
@YonatanCale not sure I understand the question, the market it isn’t about any specific fact, but the reliability and scope of facts a GPT based AI system provides being comparable to that of Wikipedia.