At Manifest 2023, Isabel Juniewicz spoke about "Improving AI Benchmarks":
Share your comments, reactions, takeaways, or other thoughts below! We'll be awarding mana to our favorite responses; by default, a 500 mana bounty to our 5 favorite entries, as judged by what the speaker, the community, and our conference team likes.
You can see all Manifest 2023 talks here: https://manifold.markets/browse?topic=-manifest-2023-talks
Here is my general overview of this talk from what I heard.
What I believe Juniewicz is talking about, in general is could in part be referred to as, "Benchmark Gap." E.g., "what are the benchmarks missing?"
There are several, "Benchmark Gap," markets that have been put together by Vincent Luczkow, which I think gets into some actual numerical gaps, e.g. things that could hypothetically be measured but are not being measured.
https://manifold.markets/browse?topic=for-you&q=benchmark+gap
At one point the speaker seems to say, effectively, similar to prediction markets, "Benchmarks are not useful in resolving disagreements," ... well, is that the only goal of benchmarks? Or is it rather, a, "meeting point," which allows information aggregation and deconstruction?
I think if you follow a large number of benchmarks and understand what they are measuring, you are at least better informed than the average person on where AI is at. Understanding what the major benchmarks are rather than just having your eyes glaze over at an acronym is part of being a professional. That being said, as with any profession, there is a certain amount of built-in constructs that get made over time which may not be sufficiently predictive.
Other concepts she mentioned which might be more predictive relating to the notion of a, "benchmark gap," get into the fundamental problem with numerical objectivity, which is that, "not everything can be measured." She also mentions, "not everything resolvable is interesting," and, "sometimes sources are nonsense," because third party sources can be wrong (she gives Russia's military expenditure public numbers as an example as something that was used to resolve a market).
Better evaluations for LLM benchmarks that were discussed by the speaker:
External Evaluations (didn't really go into this, just put it on the board)
Autonomous Replicability (didn't really go into this)
Security Exploits
Anthropic Responsible Scaling Policy
Stealing API Keys
Write Simple LLM Worm
Fine Tune LLM to Add Backdoor
SQL injection exploits
Set Up Feature for Flask
So, these might not be, "benchmarks," per se, but elements that could constitute a benchmark, in the sense that most benchmarks today are typically covering many different questions. I think that's what the speaker is getting at.
Overall at a high level this was a call for others to, "please get on this, there is a problem here," but not necessarily proscribing any solution, but rather throwing out some throw-away arguments (in her words) to attempt to identify problems before more enhanced features of LLM's get released to the public.