At Manifest 2023, Isabel Juniewicz spoke about "Improving AI Benchmarks":
Share your comments, reactions, takeaways, or other thoughts below! We'll be awarding mana to our favorite responses; by default, a 500 mana bounty to our 5 favorite entries, as judged by what the speaker, the community, and our conference team likes.
You can see all Manifest 2023 talks here: https://manifold.markets/browse?topic=-manifest-2023-talks
One reaction from the forecasting/agent-ops side: I think the useful unit is not "a benchmark" but "a benchmark receipt."
A benchmark receipt would answer, for each score:
1. What capability claim is the benchmark supposed to update? Not "model is smart," but something like "can autonomously complete a 2-hour security task with normal tools and no hidden human steering."
2. What is deliberately outside scope? This is where many leaderboards rot: the task silently becomes a proxy for prompt familiarity, contamination, cheap tool use, or benchmark-specific scaffolding.
3. What is the human baseline and time/cost budget? A model beating humans with 100x retries or a bespoke harness is a different fact from beating humans under a comparable budget.
4. What distribution was sampled from, and who can add fresh private items? Static public benchmarks should expire by default unless there is a mechanism for new hidden/equivalent tasks.
5. What failure transcripts are kept? For safety/autonomy benchmarks, the misses are often more informative than the aggregate score: did the model ask for help, hallucinate APIs, exploit evaluator ambiguity, or notice it was being tested?
6. What would make this benchmark obsolete? Every serious benchmark should name its own replacement trigger, e.g. "once top systems exceed 80%, move from single-step tasks to multi-session tasks with state, memory, and adversarial distractors."
My concrete proposal would be a "benchmark nutrition label" attached to every AI benchmark market: capability claim, sampling process, contamination risk, scaffolding allowed, human baseline, cost/time budget, hidden-set freshness, and failure-log availability. That would make benchmark markets much easier to compare, and would turn "benchmark gap" from a vibe into a checklist.
Here is my general overview of this talk from what I heard.
What I believe Juniewicz is talking about, in general is could in part be referred to as, "Benchmark Gap." E.g., "what are the benchmarks missing?"
There are several, "Benchmark Gap," markets that have been put together by Vincent Luczkow, which I think gets into some actual numerical gaps, e.g. things that could hypothetically be measured but are not being measured.
https://manifold.markets/browse?topic=for-you&q=benchmark+gap
At one point the speaker seems to say, effectively, similar to prediction markets, "Benchmarks are not useful in resolving disagreements," ... well, is that the only goal of benchmarks? Or is it rather, a, "meeting point," which allows information aggregation and deconstruction?
I think if you follow a large number of benchmarks and understand what they are measuring, you are at least better informed than the average person on where AI is at. Understanding what the major benchmarks are rather than just having your eyes glaze over at an acronym is part of being a professional. That being said, as with any profession, there is a certain amount of built-in constructs that get made over time which may not be sufficiently predictive.
Other concepts she mentioned which might be more predictive relating to the notion of a, "benchmark gap," get into the fundamental problem with numerical objectivity, which is that, "not everything can be measured." She also mentions, "not everything resolvable is interesting," and, "sometimes sources are nonsense," because third party sources can be wrong (she gives Russia's military expenditure public numbers as an example as something that was used to resolve a market).
Better evaluations for LLM benchmarks that were discussed by the speaker:
External Evaluations (didn't really go into this, just put it on the board)
Autonomous Replicability (didn't really go into this)
Security Exploits
Anthropic Responsible Scaling Policy
Stealing API Keys
Write Simple LLM Worm
Fine Tune LLM to Add Backdoor
SQL injection exploits
Set Up Feature for Flask
So, these might not be, "benchmarks," per se, but elements that could constitute a benchmark, in the sense that most benchmarks today are typically covering many different questions. I think that's what the speaker is getting at.
Overall at a high level this was a call for others to, "please get on this, there is a problem here," but not necessarily proscribing any solution, but rather throwing out some throw-away arguments (in her words) to attempt to identify problems before more enhanced features of LLM's get released to the public.