"Improving AI Benchmarks" by Isabel Juniewicz - Discussion
Ṁ2,500 / 2500
bounty left

At Manifest 2023, Isabel Juniewicz spoke about "Improving AI Benchmarks":

Share your comments, reactions, takeaways, or other thoughts below! We'll be awarding mana to our favorite responses; by default, a 500 mana bounty to our 5 favorite entries, as judged by what the speaker, the community, and our conference team likes.

You can see all Manifest 2023 talks here: https://manifold.markets/browse?topic=-manifest-2023-talks

Get Ṁ600 play money
Sort by:

Here is my general overview of this talk from what I heard.

What I believe Juniewicz is talking about, in general is could in part be referred to as, "Benchmark Gap." E.g., "what are the benchmarks missing?"

There are several, "Benchmark Gap," markets that have been put together by Vincent Luczkow, which I think gets into some actual numerical gaps, e.g. things that could hypothetically be measured but are not being measured.


At one point the speaker seems to say, effectively, similar to prediction markets, "Benchmarks are not useful in resolving disagreements," ... well, is that the only goal of benchmarks? Or is it rather, a, "meeting point," which allows information aggregation and deconstruction?

I think if you follow a large number of benchmarks and understand what they are measuring, you are at least better informed than the average person on where AI is at. Understanding what the major benchmarks are rather than just having your eyes glaze over at an acronym is part of being a professional. That being said, as with any profession, there is a certain amount of built-in constructs that get made over time which may not be sufficiently predictive.

Other concepts she mentioned which might be more predictive relating to the notion of a, "benchmark gap," get into the fundamental problem with numerical objectivity, which is that, "not everything can be measured." She also mentions, "not everything resolvable is interesting," and, "sometimes sources are nonsense," because third party sources can be wrong (she gives Russia's military expenditure public numbers as an example as something that was used to resolve a market).

Better evaluations for LLM benchmarks that were discussed by the speaker:

  • External Evaluations (didn't really go into this, just put it on the board)

  • Autonomous Replicability (didn't really go into this)

  • Security Exploits

    • Anthropic Responsible Scaling Policy

      • Stealing API Keys

      • Write Simple LLM Worm

      • Fine Tune LLM to Add Backdoor

      • SQL injection exploits

      • Set Up Feature for Flask

So, these might not be, "benchmarks," per se, but elements that could constitute a benchmark, in the sense that most benchmarks today are typically covering many different questions. I think that's what the speaker is getting at.

Overall at a high level this was a call for others to, "please get on this, there is a problem here," but not necessarily proscribing any solution, but rather throwing out some throw-away arguments (in her words) to attempt to identify problems before more enhanced features of LLM's get released to the public.

Why do only five of the talks have discussion threads? Are you going to add discussions for the rest?