Warning! This market may already be doomed, as the threshold picked is higher than human performance. However I don't know for sure, just letting you know before you bet. Here's one that's set at slightly below human performance:
Preface / Inspiration:
There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.
Market Description
ProPara
ProPara aims to promote the research in natural language understanding in the context of procedural text. This requires identifying the actions described in the paragraph and tracking state changes happening to the entities involved.
Example Question
Given this five-sentence procedural paragraph (id 1167 from the training partition):
โ The gravity of the sun pulls its mass inward. โก There is a lot of pressure on the Sun. โข The pressure forces atoms of hydrogen to fuse together in nuclear reactions. โฃ The energy from the reactions gives off different kinds of light. โค The light travels to the Earth.
Consider the two participant entities:
atoms of hydrogen
sunlight or light
Predict answers to these four questions:
What are the Inputs?
That is, which participants existed before the procedure began, and don't exist after the procedure ended? Or, what participants were consumed?
Answer: The inputs are atoms of hydrogen.
What are the Outputs?
That is, which participants existed after the procedure ended, but didn't exist before the procedure began? Or, what participants were produced?
Answer: The outputs are light (or sunlight).
What are the Conversions?
That is, which participants were converted to which other participants?
Answer: The participant atoms of hydrogen is converted into light (or sunlight) in sentence 3.
What are the Moves?
That is, which participants moved from one location to another?
Answer: The participant light (or sunlight) moves from sun to earth in sentence 5.
Market Resolution Criteria
https://leaderboard.allenai.org/propara/submissions/public
Top score on F1 is 0.731.
We would need to see 1.3*0.731 or greater by the end of the year for this to resolve YES, otherwise NO.
Please see markets embedded for how I will format this description in the future to try to make it more clear and provide more information.
New version of this market posted here: https://manifold.markets/PatrickDelaney/-will-ai-be-significantly-better-ab
@jonsimon It seems that I picked too high of a threshold for this market when I originally created it, and now it's too late to change it with so many pre-existing positions.
@Hedgehog are you talking about just the one sample I gave above or a larger view of the test labels? Honestly the only one I looked at was the example above. Another question: OK, while the test may not be a perfect arbiter of, "tracking changes in stage," (which you aren't really arguing for or against anyway) ... what about the possibility of overtraining for this specific data set? Do you have any objections if someone just overtrains to the dataset and gets a higher score, given that the market threshold has to do with a benchmark, while the market description has to do with something arguably more philosophical in nature?
@PatrickDelaney What do you mean by overtrain? The labels themselves are are often wrong or deeply ambiguous, such that nothing short of literally memorizing the test set should get you performance this high.
@Hedgehog Iโm not sure thereโs really a single coherent skill that this is isolating, but to the extent there is one I do expect it to be something we make fast progress on.
@Hedgehog Do you have any citation or reference for what you are claiming here? I am willing to compensate you in the comments with a Mannalink to help clarify the market. I am open to resolving N/A to reflect a better ground truth...the map is not the territory. That being said, as a matter of procedure I can't take assurances in the form of comments at face value, I need supporting information. Does that make sense?
@Hedgehog Could you give some examples of the labels being wrong or perhaps an article or blog post or paper which points this out?
@PatrickDelaney Iโm going off the โhumanโ number on the leaderboard. From the dataset paper, it looks like it was collected and tested on MTurk, which generally gets you at least that much noise, and which means that when the human evaluators disagree with the reference labels, theyโre coming from the same place, and about equally likely to be right.
ChatGPT's response to some of the above:
Inputs:
Within the preceding list of 5 items, there are no inputs that were consumed or no participants that existed before the procedure began and don't exist after the procedure ended. The mass of the Sun and the hydrogen atoms present in the Sun are the inputs that exist before the procedure began and continue to exist after the procedure has ended. They are not consumed in the process.
Wrong.
Outputs
Within the preceding list of 5 items, the outputs that were produced as a result of the procedure are:
Different kinds of light energy emitted by the Sun
Gravitational force acting on the Sun
Both of these outputs exist after the procedure has ended but didn't exist before the procedure began. The light energy is produced as a result of the nuclear fusion reactions and the gravitational force is a result of the mass of the Sun being pulled inward by its own gravity.
Got the light output correct, but wrong because it added in gravitational force.