Preface:
Please read the preface for this type of market and other similar third-party validated AI markets here.
Third-Party Validated, Predictive Markets: AI Theme
Market Description:
Break
As measured by, "Break," (non high-level) from Allen A.I. leaderboards here:
https://github.com/allenai/Break
https://allenai.github.io/Break/blogpost.html
"Significantly better," will be interpreted as meaning 30% better Normalized EM Score than the top post on this leaderboard at the time this market opened, compared to the end of the year, UTC.
https://leaderboard.allenai.org/break/submissions/public
Market Resolution Threshold:
At the time of authoring, the highest EM Score is:
0.4230
T5-LargeTomer Wolfson, Tel Aviv University
So to qualify as, "Understanding the, "Meaning" of Questions Significantly Better By the End of 2023," for the purposes of this market, there would need to be a submission which scores >= 0.5499 by the end of the year, UTC.
Here's what I have from the leaderboard today, resolves no:
New Market:
https://manifold.markets/PatrickDelaney/-will-ai-be-able-to-understand-the
Put together a new related market on AI capability to avoid misconceptions:
https://manifold.markets/PatrickDelaney/will-ai-be-able-to-avoid-misconcept
@vluzko ironically that's why I bought NO - even if there's a model that beats it there's no guarantee it's submitted.
@PatrickDelaney T5 sucks, it's like three generations behind. Which means no one is running their new models on this benchmark. If you ran got 3.5 on this it would probably resolve, never mind 4.
@vluzko have other models been run against other similar "meaning of words" benchmarks to support your claim?
@PatrickDelaney Do you know of any other "meaning of words" benchmarks we could check? I think the main insight here is that the last non-T5 submission to that leaderboard was almost 2 years ago, so a bet on this market might be more about the chance that researchers choose to submit their model than it is about the overall state of AI "meaning" understanding.
@DanStoyell Yes, you are absolutely right. There's a, "map vs. territory," problem here, I recognize that. So I could either 1. Change this market if we can find a better, more active leaderboard on the same topic, or 2. Create another market on that more active benchmark or 3. People might still like to speculate on AllenAI, since it seems to be the highest SEO ranked leaderboard for now, and with the idea that there's potentially more money being focused on AI in general, people might start piling into leaderboards more now...?
I am really open to suggestions.
@DanStoyell Why do you think this market is trading at 66% as opposed to closer to NO, where you have bet at this point? Is there some special knowledge that you may not have or are people speculating, not looking as deeply into what the metric is as you are at this point?
@PatrickDelaney I don't really feel like I have super special insight, I'm mostly going off the extrapolated rate of improvement over the last 2 years combined with the lack of submissions. 66% does feel quite high to me given that, but I'm not betting very much because it wouldn't really surprise me at all if a submission did come along that fit the criteria.
@PatrickDelaney I did briefly Google for a more active leaderboard but didn't see anything obvious. Making a market that objectively reflects the actual problem you're trying to get at is definitely very hard.
@DanStoyell I had forgot I put together this market as a place to bookmark more leaderboards as they come up, as well as any AI institutions that may have leaderboards that I'm not aware of yet (e.g. need to search those sites more). I know there are also one-off leaderboards out there which I have seen, maintained by a single small group of researchers. Overall I am dedicated to carving out more markets to try to build a hopefully more accurate snapshot of where AI is going beyond a lot of the speculative sci-fi nonsense which dominates the conversation right now. After having read more about Google Big Bench, it's really a collection of a ton of different separate benchmarks, similar to AllenAI, but much of them seem to not have public submissions displayed within the repo yet.