
Resolution
At the end of 2023, I will try to gauge the majority opinion among informed experts (e.g., authors of papers at top machine learning conferences) as to, "Were the GPT-4 benchmarks contaminated?" This market resolves as YES if I think 50% or more would answer yes on a survey (with only two answer choices) and NO otherwise. Because this is a subjective resolution, I will not bet in this market. If a clearly better resolution method is available, such as an actual survey of ML researchers, I will use that but try to make sure people know in advance.
Context
OpenAI has chosen to share very little information about how GPT-4 works due to "the competitive landscape and the safety implications of large-scale models." This makes it hard to validate, reproduce, and arguably to trust their results, especially the blockbuster benchmarks like "passing a simulated bar exam with a score around the top 10% of test takers."
OpenAI says they used substring matching to check for contamination. This checks for whether their training data included exact text from the benchmarks (e.g., bar exam, Codeforces, SAT, MMLU, WinoGrande), but critics still claim, "OpenAI may have tested GPT-4 on the training data." (e.g., Arvind Narayanan, popular AI blogger and CS professor).
As of March 2023, I'm very unsure how skeptical we should be of OpenAI's benchmark evaluations. It's weird to have benchmark results in ML that can't be replicated or validated, and I think there overfitting benchmarks is a huge issue in the field that only gets bigger each year. So I'm making this market to get a better prediction than I'd make on my own.
Note that there are varying degrees of "contamination" in ML. I think most researchers would agree that you don't need an exact substring match to have contamination, e.g., if I'm tested on, "Are the majority of apples red?" and I've seen the text, "Q: Are most apples red? A: Yes," then I think most would agree this is still contamination. For this market, "contamination" is left to the interpretation of the hypothetically surveyed researcher.
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ28 | |
2 | Ṁ24 | |
3 | Ṁ15 | |
4 | Ṁ13 | |
5 | Ṁ10 |