Were the GPT-4 benchmarks contaminated?
24
541
470
resolved Jan 1
Resolved
YES

Resolution
At the end of 2023, I will try to gauge the majority opinion among informed experts (e.g., authors of papers at top machine learning conferences) as to, "Were the GPT-4 benchmarks contaminated?" This market resolves as YES if I think 50% or more would answer yes on a survey (with only two answer choices) and NO otherwise. Because this is a subjective resolution, I will not bet in this market. If a clearly better resolution method is available, such as an actual survey of ML researchers, I will use that but try to make sure people know in advance.

Context
OpenAI has chosen to share very little information about how GPT-4 works due to "the competitive landscape and the safety implications of large-scale models." This makes it hard to validate, reproduce, and arguably to trust their results, especially the blockbuster benchmarks like "passing a simulated bar exam with a score around the top 10% of test takers."

OpenAI says they used substring matching to check for contamination. This checks for whether their training data included exact text from the benchmarks (e.g., bar exam, Codeforces, SAT, MMLU, WinoGrande), but critics still claim, "OpenAI may have tested GPT-4 on the training data." (e.g., Arvind Narayanan, popular AI blogger and CS professor).

As of March 2023, I'm very unsure how skeptical we should be of OpenAI's benchmark evaluations. It's weird to have benchmark results in ML that can't be replicated or validated, and I think there overfitting benchmarks is a huge issue in the field that only gets bigger each year. So I'm making this market to get a better prediction than I'd make on my own.

Note that there are varying degrees of "contamination" in ML. I think most researchers would agree that you don't need an exact substring match to have contamination, e.g., if I'm tested on, "Are the majority of apples red?" and I've seen the text, "Q: Are most apples red? A: Yes," then I think most would agree this is still contamination. For this market, "contamination" is left to the interpretation of the hypothetically surveyed researcher.

Get Ṁ200 play money

🏅 Top traders

#NameTotal profit
1Ṁ28
2Ṁ24
3Ṁ15
4Ṁ13
5Ṁ10
Sort by:

I haven't heard as much explicit discussion of whether the GPT-4 benchmarks were contaminated as I expected, so this will probably resolve more subjectively than I made it seem. I did ask a few people at NeurIPS. I've also had many conversations about the general issue of benchmark contamination, and I think there's near-consensus (about as much as you can realistically get in this field) that benchmarks are deteriorating in usefulness, and pretty general agreement (~80%?) that contamination of at least some degree is a big part of the issue (at least ~20%). Most recently this has been discussed with MMLU and Gemini.

Normally I wouldn't post a comment this suggestive of a resolution while the market is still open, but I think there's a lot of room for debate here, so I want to open the discussion to anyone who has thoughts. The biggest reason I see for a NO resolution is that "contamination" should be taken in a very narrow sense (e.g., a nearly exact substring match, an image of the test set in the training corpus), and there haven't been many clear examples of that in data that GPT-4 was likely trained on.

I expect to resolve this in the first week of January.

bought Ṁ130 of YES

Extracting training data from ChatGPT (link) has an interesting method that I and many others tried months ago but didnt realize was straight up a way to check for memorization of training data (paper). I havent read the full paper yet but would be useful to test for whether any of the benchmark tests were memorized haha

bought Ṁ80 of YES

If it had contamination in coding benchmarks, how would this resolve? For ex: plenty of talk about Leetcode problems it did well on but fails on atcoder.

predicted YES

@firstuserhere "it solves 10/10 problems from pre-2021 and 0/10 of the most recent problems (which it has never seen before) is very suspicious"

and anecdotally, I try to twist the questions from Lc and it just answers as if i asked the original LC question, as if it has seen it.

@firstuserhere it depends on how "informed experts (e.g., authors of papers at top machine learning conferences)" would view that. My current guess would be that findings like "10/10 pre-2021 problems and 0/10 recent problems" are moderate evidence of contamination to most informed experts.

The market has somewhat stabilized around 70%. That's interesting! I would have guessed lower.

Coding was; judging by its poetry and general knowledge the verbal benchmarks are vastly understated (may not be well-promoted)

*prompted