🐕 Will Any AI Effectively Achieve Higher Than Human Level at Answering Multiple Choice, Grounded Situations?

190Ṁ1560

resolved Jan 8

Resolved

ALL

Preface:

Please read the preface for this type of market and other similar third-party validated AI markets here.

Third-Party Validated, Predictive Markets: AI Theme

Market Description

HellaSwag

HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple choice questions about grounded situations: each question comes from one of two domains -- activitynet or wikihow -- with four answer choices about what might happen next in the scene. The correct answer is the (real) sentence for the next event; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.

Example HellaSwag Question

A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She

a) rinses the bucket off with soap and blow dries the dog's head.
b) uses a hose to keep it from getting soapy.
c) gets the dog wet, then it runs away again.
d) gets into the bath tub with the dog.

Answer: C.

Two Leaderboards:

https://paperswithcode.com/sota/sentence-completion-on-hellaswag

https://leaderboard.allenai.org/hellaswag/submissions/public

Papers with Code - HellaSwag Benchmark (Sentence Completion)

The current state-of-the-art on HellaSwag is GPT-4 (few-shot, k=10). See a full comparison of 31 papers with code.

Human performance is measured at 0.9560

Resolution Criteria

We will define Superintelligence for the purposes of this question as, "achieving 99% accuracy on the test in question."
Will any entry from the above two links result in a 99% Accuracy Rating? If so, resolves YES, otherwise NO.

20230727 - Changed title, "Superintelligence" to "Higher than Human Level"

Technical AI Timelines

New Year's Resolutions 2024

Third Party Validated, Predictive Markets

Third Party Validated, Predictive Markets: AI

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ198
2		Ṁ56
3		Ṁ8
4		Ṁ0

People are also trading

Will AI top level capabilities generally be judged by question and answer benchmarks in 2029?

25% chance

Will Quora questions be auto-answered by a more sophisticated bot (at level of GPT3.5 or higher) by EOY 2025?

71% chance

Will AIs beat human experts in question-answering on the GPQA benchmark before January 1st, 2027?

95% chance

By 2029, will any AI be able to read a novel and reliably answer questions about it? (Gary Marcus benchmark #2)

96% chance

When will AIs be good at solving complex problems? (read description)

Will an ai be trained/fine-tuned on Quora by 2035?

92% chance

Will a dog be uplifted into having human-level intelligence by 9999?

79% chance

Before 2028, will any AI model achieve the same or greater benchmarks as o3 high with <= 1 million tokens per question?

Sort by:

Highest metric I could find

2024 version of this market: https://manifold.markets/PatrickDelaney/-will-any-ai-effectively-achieve-hi-62a7d1e40a77

This feels more like a question about the noise ceiling on hellaswag 😀 Also, might have to specify what happens if the test set leaks on the internet.

I think you should change the title back to superintelligence; "higher than human level" is easily interpreted as just better than 0.9560

@RobertCousineau There seem to be different interpretations of what the term, "Superintelligence," may mean. My original intent on using the term, "superintelligence," was that it's a fairly recognized term and I wanted to attract people to this market under a recognizable term...a meme, if you will. However in the interests of not being overly market-baiting, I changed it to, "higher than human level performance," to try to be more accurate.

An aside...I won't directly describe these markets based upon the benchmark that they are measuring because I have found from previous activity on manifold if you make the titles too boring, no one bets on them, they have to be human readable and approachable.

I had an objection from, "decision theorist and widely recognized founder of A.I. Elizer Yudkowsky, according to Time Magazine," 😂 at using the term Superintelligence in another market, which made me think more critically about the use of this term.

If you look at wikipedia's current article on superintelligence, they refer to a quote from Nick Bostrom:

any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest

So I guess by that definition, and taking a numerical objectivism standpoint, (meaning, we assume that benchmarks are the best way to describe reality, even if there is a benchmark gap) one could argue that the amalgamation of all current active benchmarks, having surpassed human performance is a way to define a superintelligence. Yudkowsky's comment on my other market which implies, "humans can't possibly create a test that measures superintelligence," reads as quasi-religious to me so I'm just not going to entertain that line of thinking for the purposes of betting markets, because I think it's more fair to create third party validated markets and try to be as non-subjective as possible wherever one can.