61
367
αΉ€990
resolved Jan 10
Resolved
NO

Preface:

Please read the preface for this type of market and other similar third-party validated AI markets here.

Third-Party Validated, Predictive Markets: AI Theme

Market Description:

Break

As measured by, "Break," (non high-level) from Allen A.I. leaderboards here:

https://github.com/allenai/Break

https://allenai.github.io/Break/blogpost.html

"Significantly better," will be interpreted as meaning 30% better Normalized EM Score than the top post on this leaderboard at the time this market opened, compared to the end of the year, UTC.

https://leaderboard.allenai.org/break/submissions/public

Market Resolution Threshold:

At the time of authoring, the highest EM Score is:

0.4230
T5-Large

Tomer Wolfson, Tel Aviv University

So to qualify as, "Understanding the, "Meaning" of Questions Significantly Better By the End of 2023," for the purposes of this market, there would need to be a submission which scores >= 0.5499 by the end of the year, UTC.

Get αΉ€600 play money

πŸ… Top traders

#NameTotal profit
1αΉ€527
2αΉ€177
3αΉ€145
4αΉ€128
5αΉ€113
Sort by:

Here's what I have from the leaderboard today, resolves no:

New Market:

https://manifold.markets/PatrickDelaney/-will-ai-be-able-to-understand-the

sold αΉ€17 of YES

I wanted to try out the Manifold Loot box feature, which caused me to bet in my own market here. This was not intentional. I would not recommend the loot box feature for this reason.

bought αΉ€10 of YES

If T5-Large is current SOTA then the benchmark is artificially lower because no one is actually trying.

@vluzko Can you expand on that?

predicted NO

@vluzko ironically that's why I bought NO - even if there's a model that beats it there's no guarantee it's submitted.

predicted YES

@PatrickDelaney T5 sucks, it's like three generations behind. Which means no one is running their new models on this benchmark. If you ran got 3.5 on this it would probably resolve, never mind 4.

@vluzko have other models been run against other similar "meaning of words" benchmarks to support your claim?

@vluzko what is got 3.5? You mean gpt 3.5? Autocorrect?

predicted NO

@PatrickDelaney Do you know of any other "meaning of words" benchmarks we could check? I think the main insight here is that the last non-T5 submission to that leaderboard was almost 2 years ago, so a bet on this market might be more about the chance that researchers choose to submit their model than it is about the overall state of AI "meaning" understanding.

predicted YES

@PatrickDelaney yeah gpt 3.5. stupid autocorrect

@DanStoyell Yes, you are absolutely right. There's a, "map vs. territory," problem here, I recognize that. So I could either 1. Change this market if we can find a better, more active leaderboard on the same topic, or 2. Create another market on that more active benchmark or 3. People might still like to speculate on AllenAI, since it seems to be the highest SEO ranked leaderboard for now, and with the idea that there's potentially more money being focused on AI in general, people might start piling into leaderboards more now...?

I am really open to suggestions.

@DanStoyell Why do you think this market is trading at 66% as opposed to closer to NO, where you have bet at this point? Is there some special knowledge that you may not have or are people speculating, not looking as deeply into what the metric is as you are at this point?

predicted NO

@PatrickDelaney I don't really feel like I have super special insight, I'm mostly going off the extrapolated rate of improvement over the last 2 years combined with the lack of submissions. 66% does feel quite high to me given that, but I'm not betting very much because it wouldn't really surprise me at all if a submission did come along that fit the criteria.

predicted NO

@PatrickDelaney I did briefly Google for a more active leaderboard but didn't see anything obvious. Making a market that objectively reflects the actual problem you're trying to get at is definitely very hard.

@DanStoyell Yeah if you can think of anything else please let me know.

@DanStoyell I had forgot I put together this market as a place to bookmark more leaderboards as they come up, as well as any AI institutions that may have leaderboards that I'm not aware of yet (e.g. need to search those sites more). I know there are also one-off leaderboards out there which I have seen, maintained by a single small group of researchers. Overall I am dedicated to carving out more markets to try to build a hopefully more accurate snapshot of where AI is going beyond a lot of the speculative sci-fi nonsense which dominates the conversation right now. After having read more about Google Big Bench, it's really a collection of a ton of different separate benchmarks, similar to AllenAI, but much of them seem to not have public submissions displayed within the repo yet.

@DanStoyell So over time, as I comb through these, I might find a better answer to our question.

Another relevant market:

See my other relevant market:

@PatrickDelaney Another:

More related questions