๐Ÿ• Will A.I. Be Significantly Better, "Able to Track Changes in State," By the End of 2023?
Basic
41
แน€9526
resolved Jan 9
Resolved
NO

Warning! This market may already be doomed, as the threshold picked is higher than human performance. However I don't know for sure, just letting you know before you bet. Here's one that's set at slightly below human performance:

Preface / Inspiration:

  • There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.

Market Description

ProPara

ProPara aims to promote the research in natural language understanding in the context of procedural text. This requires identifying the actions described in the paragraph and tracking state changes happening to the entities involved.

Example Question

Given this five-sentence procedural paragraph (id 1167 from the training partition):

โ‘  The gravity of the sun pulls its mass inward. โ‘ก There is a lot of pressure on the Sun. โ‘ข The pressure forces atoms of hydrogen to fuse together in nuclear reactions. โ‘ฃ The energy from the reactions gives off different kinds of light. โ‘ค The light travels to the Earth.

Consider the two participant entities:

  • atoms of hydrogen

  • sunlight or light

Predict answers to these four questions:

  1. What are the Inputs?

    • That is, which participants existed before the procedure began, and don't exist after the procedure ended? Or, what participants were consumed?

    • Answer: The inputs are atoms of hydrogen.

  2. What are the Outputs?

    • That is, which participants existed after the procedure ended, but didn't exist before the procedure began? Or, what participants were produced?

    • Answer: The outputs are light (or sunlight).

  3. What are the Conversions?

    • That is, which participants were converted to which other participants?

    • Answer: The participant atoms of hydrogen is converted into light (or sunlight) in sentence 3.

  4. What are the Moves?

    • That is, which participants moved from one location to another?

    • Answer: The participant light (or sunlight) moves from sun to earth in sentence 5.

Market Resolution Criteria

https://leaderboard.allenai.org/propara/submissions/public

Top score on F1 is 0.731.

We would need to see 1.3*0.731 or greater by the end of the year for this to resolve YES, otherwise NO.

Please see markets embedded for how I will format this description in the future to try to make it more clear and provide more information.

Get
แน€1,000
and
S3.00
Sort by:
predictedNO

Created another market to try to address some concerns on this market:

As other have pointed out:

  1. Best current AI score is 0.731

  2. Best current human score is 0.839

  3. Threshold score for YES is 0.950

Seems wildly unrealistic that AI performance would leapfrog human performance that dramatically by the end of the year. Betting heavily NO.

@jonsimon meant average human score, not best human score

predictedNO

@jonsimon It seems that I picked too high of a threshold for this market when I originally created it, and now it's too late to change it with so many pre-existing positions.

predictedNO

@PatrickDelaney It happens ๐Ÿคทโ€โ™‚๏ธ

Creating a new market was the right call

This threshold for success here is well above reported human performance, and the test labels arenโ€™t perfectly objective. Iโ€™ll bet that this is impossible to achieve without memorizing the test set, no matter how good the system is at reasoning.

predictedNO

@Hedgehog are you talking about just the one sample I gave above or a larger view of the test labels? Honestly the only one I looked at was the example above. Another question: OK, while the test may not be a perfect arbiter of, "tracking changes in stage," (which you aren't really arguing for or against anyway) ... what about the possibility of overtraining for this specific data set? Do you have any objections if someone just overtrains to the dataset and gets a higher score, given that the market threshold has to do with a benchmark, while the market description has to do with something arguably more philosophical in nature?

predictedNO

@PatrickDelaney What do you mean by overtrain? The labels themselves are are often wrong or deeply ambiguous, such that nothing short of literally memorizing the test set should get you performance this high.

predictedNO

@Hedgehog Iโ€™m not sure thereโ€™s really a single coherent skill that this is isolating, but to the extent there is one I do expect it to be something we make fast progress on.

predictedNO

@Hedgehog Do you have any citation or reference for what you are claiming here? I am willing to compensate you in the comments with a Mannalink to help clarify the market. I am open to resolving N/A to reflect a better ground truth...the map is not the territory. That being said, as a matter of procedure I can't take assurances in the form of comments at face value, I need supporting information. Does that make sense?

predictedNO

@Hedgehog Could you give some examples of the labels being wrong or perhaps an article or blog post or paper which points this out?

predictedNO

@PatrickDelaney Iโ€™m going off the โ€˜humanโ€™ number on the leaderboard. From the dataset paper, it looks like it was collected and tested on MTurk, which generally gets you at least that much noise, and which means that when the human evaluators disagree with the reference labels, theyโ€™re coming from the same place, and about equally likely to be right.

ChatGPT's response to some of the above:

Inputs:

Within the preceding list of 5 items, there are no inputs that were consumed or no participants that existed before the procedure began and don't exist after the procedure ended. The mass of the Sun and the hydrogen atoms present in the Sun are the inputs that exist before the procedure began and continue to exist after the procedure has ended. They are not consumed in the process.

  • Wrong.

Outputs

Within the preceding list of 5 items, the outputs that were produced as a result of the procedure are:

  • Different kinds of light energy emitted by the Sun

  • Gravitational force acting on the Sun

Both of these outputs exist after the procedure has ended but didn't exist before the procedure began. The light energy is produced as a result of the nuclear fusion reactions and the gravitational force is a result of the mass of the Sun being pulled inward by its own gravity.

  • Got the light output correct, but wrong because it added in gravitational force.

Other example markets:

ยฉ Manifold Markets, Inc.โ€ขTerms + Mana-only Termsโ€ขPrivacyโ€ขRules