Did OpenAI get 98% on RAG?
Mini
17
1.3k
resolved Nov 27
Resolved
YES

This market asks the question of whether OpenAI's results for RAG did indeed reach 98% accuracy as is claimed in the tweet or not. I currently do not have more information about this, but apparently this is from a talk.

Get Ṁ600 play money

🏅 Top traders

#NameTotal profit
1Ṁ101
2Ṁ33
3Ṁ29
4Ṁ24
5Ṁ13
Sort by:

Top level comment with relevant OpenAI Dev Day talk:
https://www.youtube.com/watch?v=ahnGLM-RC1Y

Also including the rest of the Dev Day talks: https://community.openai.com/t/openai-dev-day-2023-breakout-sessions/505213

@e_gle Thanks!

@e_gle I sold for a loss after watching this. They demonstrated accuracy with a real (private) customer

@e_gle I sent you a tip for sharing the video, the talk provides the missing context, thanks

@firstuserhere didn't realize that was a thing. Thanks!

predicted NO

Since no one else has actually explored this question, I did some additional digging. RAG is not a benchmark. https://twitter.com/mayowaoshin/status/1721837978840895843

OpenAI's talk displays a roadmap to 2% hallucination rate (98% accuracy) by using Retrieval-Augmented Generation with a number of additional techniques layered on top. The test data was not shared, so we do not know if this was actually achieved in a real world setting or just on some small specific data set.

I think the way you have asked this question makes the market unable to resolve. "OpenAI's results for RAG... reach 98%" is not a real question.

"Did OpenAI display a roadmap to 2% hallucination rate by using RAG techniques?" That question probably resolves as Yes.

"Did OpenAI publicly release an LLM that achieves 98% accuracy in real-world data sets?" That resolves to No.

bought Ṁ50 NO from 90% to 81%

@e_gle Well, of course we don't know what they tested it on, which is why its speculation. If correct, I think it's reasonable to assume that they didn't just overfit a dataset or use a contaminated test set and call it 98% accuracy. It's based on that assumption that we can speculate, but of course, as jack says below, it's meaningless without context.

P.S. Can you link to the talk you're refering to?

predicted YES

I'm interpreting the question narrowly as "Did they really get the claimed results, 98% on whatever RAG benchmark that they talked about", based on the previous clarifications. I think other interpretations don't really make sense with the question.

predicted YES

RAG is not a benchmark, it's a technique. The slide says "RAG success story". This slide is just presenting the performance of some specific techniques on some specific problem.

@jack Yeah, given the limited amount of information there is publicly available, that's what makes the most sense to ask

predicted YES

To be clear, I think this question is largely pointless. I'm very confident that the slide is likely as accurate as any other slide of its nature, and that the tweet that this has anything to do with the OpenAI/Altman drama is not accurate.

@jack That's what I'm leaning towards as well. The slide's validity is what the market was asking and, it seems likely accurate, but i also agree with this being a pointless question now

I'm very confident that they wouldn't have presented a made up graph. The benchmark could be almost anything, it's meaningless without context. Would be nice if someone finds the context from the talk.

This was a talk, did you watch it?

I did. It’s not some universal thing, this market makes no sense.

@SneakySly no, I haven't. Thanks for the info, will see

@SneakySly do you have a link to this talk?

predicted YES

@SneakySly Yep! I think the only way this resolves NO is if the performance claim of 98% was just made up, or fraud, or wrong due to an innocent mistake. And I see no reason to expect that.

I thought RAG is a technique (Retrieval-Augmented Generation), so it's confusing to ask if they got 98% on RAG? It's not a benchmark test like MMLU or something?

Not an expert so anyone feel free to clarify.

@KevinCornea I'm assuming they must've had something

https://arxiv.org/abs/2309.01431

@firstuserhere tweet makes a pair of claims: the performance graph, and the claim that openai got spooked as a result. Your title implies the question is only about the former but the description implies it's about the whole tweet. Which is it?

@firstuserhere reminder on the above question. Completely different questions depending on if I go by the title or the tweet.

@jack thanks, fixed