Did OpenAI get 98% on RAG?

310Ṁ1314

resolved Nov 27

Resolved

YES

ALL

This market asks the question of whether OpenAI's results for RAG did indeed reach 98% accuracy as is claimed in the tweet or not. I currently do not have more information about this, but apparently this is from a talk.

Technology

OpenAI

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ101
2		Ṁ33
3		Ṁ29
4		Ṁ24
5		Ṁ13

People are also trading

Will OpenAI be in the lead in the AGI race end of 2026?

34% chance

Will OpenAI's next major LLM (after GPT-4) surpass 74% accuracy on the GPQA benchmark?

97% chance

Will OpenAI's next major LLM (after GPT-4) surpass 70% accuracy on the GPQA benchmark?

98% chance

Will OpenAI get unrestricted access to >60% of the work done by METR baseliners?

32% chance

Will OpenAI's o4 get above 50% on humanity's last exam?

26% chance

Open-Source AI model gets perfect IMO 2026 score? [International Math Olympiad 2026]

31% chance

Will OpenAI claim that it has achieved AGI in 2025?

6% chance

When will a real money prediction market about OpenAI announcing AGI first stay above 90% for one month?

Will OpenAI models achieve ≥90% on SimpleBench by the end of 2025?

28% chance

Will OpenAI fold in 2025?

Sort by:

Top level comment with relevant OpenAI Dev Day talk:
https://www.youtube.com/watch?v=ahnGLM-RC1Y

Also including the rest of the Dev Day talks: https://community.openai.com/t/openai-dev-day-2023-breakout-sessions/505213

@e_gle Thanks!

@e_gle I sold for a loss after watching this. They demonstrated accuracy with a real (private) customer

@e_gle I sent you a tip for sharing the video, the talk provides the missing context, thanks

@firstuserhere didn't realize that was a thing. Thanks!

predictedNO

Since no one else has actually explored this question, I did some additional digging. RAG is not a benchmark. https://twitter.com/mayowaoshin/status/1721837978840895843

OpenAI's talk displays a roadmap to 2% hallucination rate (98% accuracy) by using Retrieval-Augmented Generation with a number of additional techniques layered on top. The test data was not shared, so we do not know if this was actually achieved in a real world setting or just on some small specific data set.

I think the way you have asked this question makes the market unable to resolve. "OpenAI's results for RAG... reach 98%" is not a real question.

"Did OpenAI display a roadmap to 2% hallucination rate by using RAG techniques?" That question probably resolves as Yes.

"Did OpenAI publicly release an LLM that achieves 98% accuracy in real-world data sets?" That resolves to No.

@e_gle Well, of course we don't know what they tested it on, which is why its speculation. If correct, I think it's reasonable to assume that they didn't just overfit a dataset or use a contaminated test set and call it 98% accuracy. It's based on that assumption that we can speculate, but of course, as jack says below, it's meaningless without context.

P.S. Can you link to the talk you're refering to?

predictedYES

I'm interpreting the question narrowly as "Did they really get the claimed results, 98% on whatever RAG benchmark that they talked about", based on the previous clarifications. I think other interpretations don't really make sense with the question.

predictedYES

RAG is not a benchmark, it's a technique. The slide says "RAG success story". This slide is just presenting the performance of some specific techniques on some specific problem.

@jack Yeah, given the limited amount of information there is publicly available, that's what makes the most sense to ask

predictedYES

To be clear, I think this question is largely pointless. I'm very confident that the slide is likely as accurate as any other slide of its nature, and that the tweet that this has anything to do with the OpenAI/Altman drama is not accurate.

@jack That's what I'm leaning towards as well. The slide's validity is what the market was asking and, it seems likely accurate, but i also agree with this being a pointless question now

I'm very confident that they wouldn't have presented a made up graph. The benchmark could be almost anything, it's meaningless without context. Would be nice if someone finds the context from the talk.

This was a talk, did you watch it?

I did. It’s not some universal thing, this market makes no sense.

@SneakySly no, I haven't. Thanks for the info, will see

@SneakySly do you have a link to this talk?

predictedYES

@SneakySly Yep! I think the only way this resolves NO is if the performance claim of 98% was just made up, or fraud, or wrong due to an innocent mistake. And I see no reason to expect that.

I thought RAG is a technique (Retrieval-Augmented Generation), so it's confusing to ask if they got 98% on RAG? It's not a benchmark test like MMLU or something?

Not an expert so anyone feel free to clarify.

@KevinCornea I'm assuming they must've had something

https://arxiv.org/abs/2309.01431

Tweet: https://twitter.com/IntuitMachine/status/1726661130322350212

@firstuserhere tweet makes a pair of claims: the performance graph, and the claim that openai got spooked as a result. Your title implies the question is only about the former but the description implies it's about the whole tweet. Which is it?