🐕 Will A.I. Be Significantly Better, "Able to Track Changes in State," By the End of 2023?
closes Jan 1

Warning! This market may already be doomed, as the threshold picked is higher than human performance. However I don't know for sure, just letting you know before you bet. Here's one that's set at slightly below human performance:

Preface / Inspiration:

  • There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.

Market Description


ProPara aims to promote the research in natural language understanding in the context of procedural text. This requires identifying the actions described in the paragraph and tracking state changes happening to the entities involved.

Example Question

Given this five-sentence procedural paragraph (id 1167 from the training partition):

① The gravity of the sun pulls its mass inward. ② There is a lot of pressure on the Sun. ③ The pressure forces atoms of hydrogen to fuse together in nuclear reactions. ④ The energy from the reactions gives off different kinds of light. ⑤ The light travels to the Earth.

Consider the two participant entities:

  • atoms of hydrogen

  • sunlight or light

Predict answers to these four questions:

  1. What are the Inputs?

    • That is, which participants existed before the procedure began, and don't exist after the procedure ended? Or, what participants were consumed?

    • Answer: The inputs are atoms of hydrogen.

  2. What are the Outputs?

    • That is, which participants existed after the procedure ended, but didn't exist before the procedure began? Or, what participants were produced?

    • Answer: The outputs are light (or sunlight).

  3. What are the Conversions?

    • That is, which participants were converted to which other participants?

    • Answer: The participant atoms of hydrogen is converted into light (or sunlight) in sentence 3.

  4. What are the Moves?

    • That is, which participants moved from one location to another?

    • Answer: The participant light (or sunlight) moves from sun to earth in sentence 5.

Market Resolution Criteria


Top score on F1 is 0.731.

We would need to see 1.3*0.731 or greater by the end of the year for this to resolve YES, otherwise NO.

Please see markets embedded for how I will format this description in the future to try to make it more clear and provide more information.

Get Ṁ500 play money

Related questions

In 2028, will AI be at least as big a political issue as abortion?
ScottAlexander avatarScott Alexander
38% chance
Will AI be a major topic during the 2024 presidential debates in the United States?
MatthewBarnett avatarMatthew Barnett
28% chance
Will Biden sign an executive order primarily focused on AI in 2023?
SG avatarS G
50% chance
Will an AI get gold on any International Math Olympiad by 2025?
Austin avatarAustin
31% chance
Will AI pass the Longbets version of the Turing test by the end of 2029?
dreev avatarDaniel Reeves
52% chance
Will I observe significant Negative Polarization around AI generated art in 2023?
LarsDoucet avatarLars Doucet
30% chance
Will AI outcompete best humans in competitive programming before the end of 2023?
Will there have been a noticeable sector-wide economic effect from a new AI technology by the end of 2023?
Nostradamnedus avatarNostradamnedus
16% chance
Will >$100M be invested in dedicated AI Alignment organizations in the next year as more people become aware of the risk we are facing by letting AI capabilities run ahead of safety?
BionicD0LPH1N avatarBionic
81% chance
Will Tyler Cowen agree that an 'actual mathematical model' for AI X-Risk has been developed by October 15, 2023?
JoeBrenton avatarJoe Brenton
9% chance
Will AI be a Time Person of the Year in 2023?
Will anyone very famous claim to have made an important life decision because an AI suggested it by the end of 2023?
IsaacKing avatarIsaac
22% chance
Will I use an x.ai product during 2023?
jacksonpolack avatarjackson polack
22% chance
🐕 Will A.I. Be Able to Make Significantly Better, "Common Sense Judgements About What Happens Next," by End of 2023?
PatrickDelaney avatarPatrick Delaney
41% chance
Will an AI system be known to have resisted shutdown before 2024?
PeterWildeford avatarPeter Wildeford
14% chance
Will Science's Top Breakthrough of the Year in 2023 be AI-related?
dp avatardp
40% chance
Google Trends: Will "AI" search term popularity peak again in 2023?
itsTomekK avatarTomek K 🟡
46% chance
Will AI be a Time Person of the Year in 2023?
Will Biden sign an executive order primarily focused on AI through Oct 2023?
StrayClimb avatarCalvinball
20% chance
49. Will AI win a programming competition in 2023?
ACXBot avatarACX BotBot
14% chance
Sort by:
PatrickDelaney avatar
Patrick Delaneypredicts NO

Created another market to try to address some concerns on this market:

jonsimon avatar
Jon Simon

As other have pointed out:

  1. Best current AI score is 0.731

  2. Best current human score is 0.839

  3. Threshold score for YES is 0.950

Seems wildly unrealistic that AI performance would leapfrog human performance that dramatically by the end of the year. Betting heavily NO.

3 replies
jonsimon avatar
Jon Simonbought Ṁ0 of NO

@jonsimon meant average human score, not best human score

PatrickDelaney avatar
Patrick Delaneypredicts NO

@jonsimon It seems that I picked too high of a threshold for this market when I originally created it, and now it's too late to change it with so many pre-existing positions.

jonsimon avatar
Jon Simonpredicts NO

@PatrickDelaney It happens 🤷‍♂️

Creating a new market was the right call

Hedgehog avatar
🦔bought Ṁ1,000 of NO

This threshold for success here is well above reported human performance, and the test labels aren’t perfectly objective. I’ll bet that this is impossible to achieve without memorizing the test set, no matter how good the system is at reasoning.

6 replies
PatrickDelaney avatar
Patrick Delaneypredicts NO

@Hedgehog are you talking about just the one sample I gave above or a larger view of the test labels? Honestly the only one I looked at was the example above. Another question: OK, while the test may not be a perfect arbiter of, "tracking changes in stage," (which you aren't really arguing for or against anyway) ... what about the possibility of overtraining for this specific data set? Do you have any objections if someone just overtrains to the dataset and gets a higher score, given that the market threshold has to do with a benchmark, while the market description has to do with something arguably more philosophical in nature?

Hedgehog avatar
🦔predicts NO

@PatrickDelaney What do you mean by overtrain? The labels themselves are are often wrong or deeply ambiguous, such that nothing short of literally memorizing the test set should get you performance this high.

Hedgehog avatar
🦔predicts NO

@Hedgehog I’m not sure there’s really a single coherent skill that this is isolating, but to the extent there is one I do expect it to be something we make fast progress on.

PatrickDelaney avatar
Patrick Delaneypredicts NO

@Hedgehog Do you have any citation or reference for what you are claiming here? I am willing to compensate you in the comments with a Mannalink to help clarify the market. I am open to resolving N/A to reflect a better ground truth...the map is not the territory. That being said, as a matter of procedure I can't take assurances in the form of comments at face value, I need supporting information. Does that make sense?

PatrickDelaney avatar
Patrick Delaneypredicts NO

@Hedgehog Could you give some examples of the labels being wrong or perhaps an article or blog post or paper which points this out?

Hedgehog avatar
🦔predicts NO

@PatrickDelaney I’m going off the ‘human’ number on the leaderboard. From the dataset paper, it looks like it was collected and tested on MTurk, which generally gets you at least that much noise, and which means that when the human evaluators disagree with the reference labels, they’re coming from the same place, and about equally likely to be right.

PatrickDelaney avatar
Patrick Delaney

ChatGPT's response to some of the above:


Within the preceding list of 5 items, there are no inputs that were consumed or no participants that existed before the procedure began and don't exist after the procedure ended. The mass of the Sun and the hydrogen atoms present in the Sun are the inputs that exist before the procedure began and continue to exist after the procedure has ended. They are not consumed in the process.

  • Wrong.


Within the preceding list of 5 items, the outputs that were produced as a result of the procedure are:

  • Different kinds of light energy emitted by the Sun

  • Gravitational force acting on the Sun

Both of these outputs exist after the procedure has ended but didn't exist before the procedure began. The light energy is produced as a result of the nuclear fusion reactions and the gravitational force is a result of the mass of the Sun being pulled inward by its own gravity.

  • Got the light output correct, but wrong because it added in gravitational force.

PatrickDelaney avatar
Patrick Delaney

Other example markets: