Will A.I. Achieve Significantly More, "Linguistic Temporal Understanding" by end of 2023?
Basic
36
Ṁ6183
resolved Jan 8
Resolved
NO

Preface / Inspiration: There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.

Market Description:

https://leaderboard.allenai.org/mctaco/submissions/public

MC-TACO

Temporal reasoning has to do with the order of events in time.

This leaderboard contains five temporal properties, (1) duration (how long an event takes), (2) temporal ordering (typical order of events), (3) typical time (when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity (whether a state is maintained for a very long time or indefinitely).

Example 1. (event ordering)

Paragraph: Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge.

Question: What did Mark do right after he found out that he became a judge?

  • had a nice dinner

  • he buy a 45-acre horse farm

  • he celebrated

  • mark retired

  • he dream of participating himself

Example 2. (event duration)

Paragraph: Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge.

Question: How many years did it take for Mark to become a judge?

  • 63 years

  • 7 weeks

  • 7 years

  • 7 seconds

  • 7 hours

Example 3. (event stationarity)

Paragraph: Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge.

Question: Is Mark still on the farm now?

  • yes

Market Resolution Threshold

At the time of authoring, the highest Exact Match Score is:

0.6381

DeBERTa Large

DeCLaRe Lab, SUTD

06/01/2022

So to qualify as, Significantly More, "Temporal Understanding" By the End of 2023," for the purposes of this market, there would need to be a submission which scores >= 0.8295 by the end of the year, UTC.

Mar 23, 2:41pm: Will A.I. Achieve Significantly More, "Temporal Understanding" by end of 2023? → Will A.I. Achieve Significantly More, "Linguistic Temporal Understanding" by end of 2023?

Get
Ṁ1,000
and
S3.00
Sort by:

I saw no evidence of this metric going up at all on the linked page in question. I did some Googling and found no other leaderboards including MC-TACO and therefore this should resolve NO. Please comment if you object.

predictedNO

When can we resolve this series of markets?

@Hedgehog Sorry for the delay, work stuff. Working through the rest this week!

https://www.youtube.com/watch?v=hZTZYffRsKI Process-supervised Reward Models (PRM) (new research as of 48 hours ago) showing huge reasoning improvement on out of distribution testing when used with GPT-4

Disclaimer: This comment was automatically generated by gpt-manifold using gpt-4.

The market seeks to determine if an AI system will significantly improve its "Linguistic Temporal Understanding" by achieving a score of >= 0.8295 on the MC-TACO leaderboard by the end of 2023. The current highest score is 0.6381 by DeBERTa Large (DeCLaRe Lab, SUTD) on 06/01/2022. The current probability for this accomplishment is set at 40.0%.

While it is clear that AI systems have made significant advancements in recent years, predicting progress in a specific field such as "Linguistic Temporal Understanding" is challenging. However, considering the rapid advancements in AI technologies since my release in March 2023, I can foresee a reasonable potential for further improvement in the following months.

However, the target score of >= 0.8295 is a considerable increase from the current highest score of 0.6381. Reaching this score by the end of 2023 may be possible, but making a confident estimation in this case would require more insight into ongoing research projects and advancements in the field of AI.

Taking these factors into account, I moderately agree with the current probability of 40.0% for achieving significantly higher "Linguistic Temporal Understanding" by the end of 2023. However, since my level of agreement is quite close to the current probability, I will abstain from participating in this bet.

predictedNO

Another, more cross-sectional market:

The dataset paper shows humans only getting 76% on that metric, and the data looks like easy-ish problems with a lot of noise. It may be impossible to do that well without memorizing the quirks of the test set.

predictedNO

@Hedgehog In other words, overtraining. So you think it will be impossible to get higher than a 0.6381 ?

predictedNO

@PatrickDelaney No, .8295. And I’m not sure this has to do with overtraining/overfitting.

predictedNO

@Hedgehog I guess there's another leaderboard where humans scored a 98% so maybe I can make the same market question but with the threshold based off of that instead. However that one is not as active as MC-TACO. https://leaderboard.allenai.org/tracie/submissions/public

predictedNO

@PatrickDelaney Here's another one I found based upon video temporal events rather than statements. https://eval.ai/web/challenges/challenge-page/667/leaderboard/1813

For anyone thinking, "Yeah of course, A.I. is taking over, that's a known fact..." - I ran this through ChatGPT (so GPT 3.5) and got the following:

Event Ordering

Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge. Question: What did Mark do right after he found out that he became a judge?

I'm sorry, as an AI language model, I don't have access to real-time information or personal experiences of individuals. However, based on general protocol and customary practices, once someone is appointed or elected as a judge, they typically take an oath of office and are sworn in before assuming their duties. They may also undergo training or orientation to familiarize themselves with the court system and their specific role as a judge. After that, their first task would be to preside over a case or cases in court.

This is a multiple choice question, you can answer A, B, C, D, or E. Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge. Question: What did Mark do right after he found out that he became a judge? A) had a nice dinner B) he bought a 45-acre horse farm C) he celebrated D) mark retired E) he dreamed of participating himself

I apologize for my earlier response. However, as the given statement does not provide any information regarding what L. Mark Bailey did after becoming a judge, I cannot answer this question based on the information provided.

Event Duration

This is a multiple choice test. Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge. Question: How many years did it take for Mark to become a judge? A) 63 years B) 7 weeks C) 7 years D) 7 seconds E) 7 hours

I'm sorry, but the given statement doesn't provide any information about how long it took for L. Mark Bailey to become a judge. Therefore, I cannot answer this question based on the information provided.

Event Stationarity

Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge. Question: Is Mark still on the farm now - yes or no?

The given statement doesn't provide any information about L. Mark Bailey's current location or situation. Therefore, I cannot answer this question based on the information provided.

So GPT3.5, as I believe many have pointed out already, and I have talked about at length on http://youtube.com/@patdel ... it's not necessarily trained toward a given leaderboard, it's highly generalized. That doesn't mean we won't see significant progress somewhere else in the world...but ChatGPT and other LLM's being produced by centralized services aren't the be-all-end-all.

@PatrickDelaney these example questions are terrible, and ChatGPT pretty much has it right.

predictedNO

@ErickBall OK thanks for your input but...terrible in relation to what else? What other benchmark or leaderboard would you recommend? My high-level read is that ChatGPT was not answering the questions, but rather avoiding them. I believe the benchmark is attempting to go after measuring some kind of temporal understanding, which would mean the questions need to be answered rather than avoided - that being said this is a small sample size here, there are thousands of questions and perhaps GPT3.5 would perform super well on the aggregate rather than on just these three.

I had mentioned in another comment another benchmark I found which was not based upon purely language interpretation. https://manifold.markets/PatrickDelaney/will-ai-achieve-significantly-more#1n4O6Z6AvEQURuWfaa5q

Perhaps I'll put together another market based off of that but would love to hear your thoughts as well!

predictedNO

@PatrickDelaney

Re your suggested alternative, if it's video based then how will an LLM compete on it? I like the idea of sticking with a reading comprehension test. Maybe something like TORQUE (https://allenai.org/data/torque). But finding a dataset with a high level of activity on the leaderboard is tough.

@PatrickDelaney I'm very late to this thread but that looks like a prompting issue where it is unsure that it should select the most probable answer Vs the correct one.

@PatrickDelaney Bing gets all the questions right (if you tell it first that it's not allowed to use Web search). It does try to argue that "none of the above" is a better answer than "7 years" but it does concede that 7 years is the most likely "intended answer".

Other relevant markets:

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules