๐Ÿ• Will A.I. Get Significantly Better at Evaluating Scientific Claims by the end of 2024? (As Measured By Leaderboard)
โž•
Plus
88
แน€9595
Jan 1
3%
chance

Preface / Inspiration:

  • There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.

Market Description

SciFact

SciFact is a public leaderboard challenge to attempt to measure AI scientific claims in terms of whether they are supported by evidence in tuples. Inspiration for SciFact From AllenAI:

Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims.

This challenge employs a public dataset of Claims, Evidence and Decisions which anyone can participate in evaluating. https://leaderboard.allenai.org/scifact/submissions/get-started

Here's an example of a couple Claim vs. Evidence from the :

  • Claim: Prescribed exercise training improves quality of life.

  • Evidence: At 3 months, usual care plus exercise training led to greater improvement in the KCCQ overall summary score (mean, 5.21; 95% confidence interval, 4.42 to 6.00) compared with usual care alone (3.28; 95% confidence interval, 2.48 to 4.09).

  • Decision: SUPPORT

  • Claim: Patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anaemia.

  • Evidence: The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA.

  • Decision: REFUTE

Market Resolution Criteria

https://leaderboard.allenai.org/scifact/submissions/public

  • Using my standard metric that I have employed in a few other market places, will any entry surpass the top entry (for Sent+X F1 Score) by the end of the timeperiod by a factor of 1.3?

  • At the time of authoring, the top score is:

MultiVerS

Allen Institute for AI and Unโ€ฆ

06/04/2021 0.6721

Therefore, will any entry on this leaderboard be equal to or greater than 0.8737 by the end of 2024? If so, market resolves YES, otherwise NO.

Get
แน€1,000
and
S3.00
Sort by:

No one has bothered to apply a frontier language model to this dataset, which appears to be largely defunct judging by the fact the top entry is from 2021. The leaderboard is thus a poor measure of the question stated in the title.

sold แน€24 YES

@TimothyScriven Good point.

@PhilosophyBear Updated the title

Whatโ€™s the metric here? The leaderboard doesnโ€™t seem to show the one youโ€™re referencing.

@Hedgehog Guessing Sent + L instead of Sent + X, since that matches the number above.

According to this article we are far from it:

https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/

I empathize with poor Yann LeCun, who have to defend this mess from his company, while he is one of the fews that acknowledge that LLM limitations might not be salvageable.

predictedYES

@Zardoru From most of what I have heard from Yann LeCun on podcasts, he does not seem to have much faith in GPT's to bring in the next wave of transformative A.I. so I read his quote there as possibly being taken out of context. If not, that sure sounds like a strong overreaction to cover his own butt...? Keep in mind that this article came out prior to ChatGPT or other waves of LLM's we have had since then, so performance of subsequent LLM's could have been far better than even GPT-based Galactica.

As with the other Allen Institute markets I would be shocked if current LLMs could not already crush SOTA (actual current LLMs might have included the dataset in their training, but if you removed it I would not expect performance to shift much). I also suspect that no one will bother to actually run them - given how bad SOTA is on most of the Allen Insititute benchmarks I think it's pretty clear that no one cares about them.

predictedYES

@vluzko on the other hand, go ahead and run some tests on your chosen OpenAI API rather than just talking?

predictedYES

@vluzko also I don't see any bet?

predictedYES

@vluzko you can add better benchmarking institution suggestions here.

@Messi thanks for your earlier comments and contributions about having originally spoken Spanish but now being able to communicate more with ChatGPT. That's fascinating. I speak both Spanish and English fluently. My Spanish reading is more like a fourth grader, I can read magazines and such pretty well but big complicated books are hard for me.

predictedNO

{reposting the comment explaining why I bought no}

yeah so I bet 50 mana which brought the market down to 10% cuz it was new (i didn't know that you had just created it haha), so i sold and brought down to 35%.

As for why I bought NO, science derives from underlying principles of Nature, and we accumulate evidence related to that principle. If the evidence disagrees with the theory we have proposed, then the theory is wrong. If the evidence agrees with the theory, then it might be right. This is a one-way relationship b/w evidence and hypotheses/theory.

In the worldview of LLMs (which is correct for LLMs themselves, but may not be directly coherent for our perception of worldview), the theory and evidence (i think) do not possess such a relationship. Their worldview is based (current systems only) on the correlation and similarities between concepts and is abstracted away in some way.

So, a 1.3X improvement would be quite significant imo. There are a lot of things which are not axiomatic and can be easily learned by LLMs to classify as correct or not correct based on their relationships, but the distinction between (evidence overrules theory) and (evidence and theory exist in the same space closely and are bidirectionally related) is not so clear as far as i know.

@firstuserhere OK thanks for that. For anyone else joining in, the below comment thread was attacked by a troll, which is why we restarted this.

predictedNO

@PatrickDelaney here's a video i just found which illustrates an example of my above intuition for failure modes (not directly but i hope the example stimulates thinking about why)

@firstuserhere This is probably a stupid way of doing this, but I'm putting a small amount on YES because I tested the above challenges (super tiny sample size) in ChatGPT, and it got both correct. So my thought is, someone could come in and use a more sophisticated LLM inference model, perhaps LLAMA or something, and crush this.

Of course that's a small sample size and if we really measured, it might be no where near the ~0.87 threshold.

Claim: Prescribed exercise training improves quality of life. Evidence: At 3 months, usual care plus exercise training led to greater improvement in the KCCQ overall summary score (mean, 5.21; 95% confidence interval, 4.42 to 6.00) compared with usual care alone (3.28; 95% confidence interval, 2.48 to 4.09). Question for ChatGPT ... does the Evidence support the above Claim?

Yes, the evidence provided supports the claim that prescribed exercise training improves the quality of life. The study found that at 3 months, patients who received usual care plus exercise training showed greater improvement in the KCCQ overall summary score compared to those who received usual care alone. This suggests that exercise training has a positive impact on patients' quality of life. Additionally, the study provides a confidence interval for the improvement in the KCCQ score, which further supports the conclusion that the improvement is statistically significant.

Claim: Patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anaemia. Evidence: The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA. Question for ChatGPT ... does the above Evidence support the Claim?

No, the evidence provided does not support the claim that patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anemia. The evidence actually suggests the opposite, stating that the increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassemia may contribute substantially to their protection against severe malarial anemia (SMA). Therefore, the evidence contradicts the claim.

@PatrickDelaney LLaMA is not better than ChatGPT. In fact, it's not even close to ChatGPT. Maybe you mean something like GPT-4?

predictedYES

@ShadowyZephyr define not even close. What benchmarks and what matgins? Gpt3.5 and Gpt4 are effectively analogous to ChatGPT but we could discuss any of them.

Comment hidden
ยฉ Manifold Markets, Inc.โ€ขTerms + Mana-only Termsโ€ขPrivacyโ€ขRules