3
Will A.I. Get Significantly Better at Evaluating Scientific Claims by the end of 2024?
60
closes 2025
51%
chance

Preface / Inspiration:

  • There are a lot of questions on Manifold about whether or not we'll see sentience, general A.I., and a lot of other nonsense and faith-based questions which rely on the market maker's interpretation and often close at some far distant point in the future when a lot of us will be dead. This is an effort to create meaningful bets on important A.I. questions which are referenced by a third party.

Market Description

SciFact

SciFact is a public leaderboard challenge to attempt to measure AI scientific claims in terms of whether they are supported by evidence in tuples. Inspiration for SciFact From AllenAI:

Due to the rapid growth in the scientific literature, there is a need for automated systems to assist researchers and the public in assessing the veracity of scientific claims.

This challenge employs a public dataset of Claims, Evidence and Decisions which anyone can participate in evaluating. https://leaderboard.allenai.org/scifact/submissions/get-started

Here's an example of a couple Claim vs. Evidence from the :

  • Claim: Prescribed exercise training improves quality of life.

  • Evidence: At 3 months, usual care plus exercise training led to greater improvement in the KCCQ overall summary score (mean, 5.21; 95% confidence interval, 4.42 to 6.00) compared with usual care alone (3.28; 95% confidence interval, 2.48 to 4.09).

  • Decision: SUPPORT

  • Claim: Patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anaemia.

  • Evidence: The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA.

  • Decision: REFUTE

Market Resolution Criteria

https://leaderboard.allenai.org/scifact/submissions/public

  • Using my standard metric that I have employed in a few other market places, will any entry surpass the top entry (for Sent+X F1 Score) by the end of the timeperiod by a factor of 1.3?

  • At the time of authoring, the top score is:

MultiVerS

Allen Institute for AI and Un…

06/04/2021 0.6721

Therefore, will any entry on this leaderboard be equal to or greater than 0.8737 by the end of 2024? If so, market resolves YES, otherwise NO.

Sort by:
Zardoru avatar
Zardorubought Ṁ40 of NO

According to this article we are far from it:

https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/

Why Meta’s latest large language model survived only three days online
Why Meta’s latest large language model survived only three days online
Galactica was supposed to help scientists. Instead, it mindlessly spat out biased and incorrect nonsense.

I empathize with poor Yann LeCun, who have to defend this mess from his company, while he is one of the fews that acknowledge that LLM limitations might not be salvageable.

vluzko avatar
Vincent Luczkow

As with the other Allen Institute markets I would be shocked if current LLMs could not already crush SOTA (actual current LLMs might have included the dataset in their training, but if you removed it I would not expect performance to shift much). I also suspect that no one will bother to actually run them - given how bad SOTA is on most of the Allen Insititute benchmarks I think it's pretty clear that no one cares about them.

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 51%

@vluzko on the other hand, go ahead and run some tests on your chosen OpenAI API rather than just talking?

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 51%

@vluzko also I don't see any bet?

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 51%

@vluzko you can add better benchmarking institution suggestions here.

PatrickDelaney avatar
Patrick Delaney

@Messi thanks for your earlier comments and contributions about having originally spoken Spanish but now being able to communicate more with ChatGPT. That's fascinating. I speak both Spanish and English fluently. My Spanish reading is more like a fourth grader, I can read magazines and such pretty well but big complicated books are hard for me.

firstuserhere avatar
firstuserhereis predicting NO at 35%

{reposting the comment explaining why I bought no}

yeah so I bet 50 mana which brought the market down to 10% cuz it was new (i didn't know that you had just created it haha), so i sold and brought down to 35%.

As for why I bought NO, science derives from underlying principles of Nature, and we accumulate evidence related to that principle. If the evidence disagrees with the theory we have proposed, then the theory is wrong. If the evidence agrees with the theory, then it might be right. This is a one-way relationship b/w evidence and hypotheses/theory.

In the worldview of LLMs (which is correct for LLMs themselves, but may not be directly coherent for our perception of worldview), the theory and evidence (i think) do not possess such a relationship. Their worldview is based (current systems only) on the correlation and similarities between concepts and is abstracted away in some way.

So, a 1.3X improvement would be quite significant imo. There are a lot of things which are not axiomatic and can be easily learned by LLMs to classify as correct or not correct based on their relationships, but the distinction between (evidence overrules theory) and (evidence and theory exist in the same space closely and are bidirectionally related) is not so clear as far as i know.

PatrickDelaney avatar
Patrick Delaney

@firstuserhere OK thanks for that. For anyone else joining in, the below comment thread was attacked by a troll, which is why we restarted this.

firstuserhere avatar
firstuserhereis predicting NO at 35%

@PatrickDelaney here's a video i just found which illustrates an example of my above intuition for failure modes (not directly but i hope the example stimulates thinking about why)

PatrickDelaney avatar
Patrick Delaneybought Ṁ10 of YES

@firstuserhere This is probably a stupid way of doing this, but I'm putting a small amount on YES because I tested the above challenges (super tiny sample size) in ChatGPT, and it got both correct. So my thought is, someone could come in and use a more sophisticated LLM inference model, perhaps LLAMA or something, and crush this.

Of course that's a small sample size and if we really measured, it might be no where near the ~0.87 threshold.

Claim: Prescribed exercise training improves quality of life. Evidence: At 3 months, usual care plus exercise training led to greater improvement in the KCCQ overall summary score (mean, 5.21; 95% confidence interval, 4.42 to 6.00) compared with usual care alone (3.28; 95% confidence interval, 2.48 to 4.09). Question for ChatGPT ... does the Evidence support the above Claim?

Yes, the evidence provided supports the claim that prescribed exercise training improves the quality of life. The study found that at 3 months, patients who received usual care plus exercise training showed greater improvement in the KCCQ overall summary score compared to those who received usual care alone. This suggests that exercise training has a positive impact on patients' quality of life. Additionally, the study provides a confidence interval for the improvement in the KCCQ score, which further supports the conclusion that the improvement is statistically significant.

Claim: Patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anaemia. Evidence: The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA. Question for ChatGPT ... does the above Evidence support the Claim?

No, the evidence provided does not support the claim that patients with microcytosis and higher erythrocyte count are more vulnerable to severe malarial anemia. The evidence actually suggests the opposite, stating that the increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassemia may contribute substantially to their protection against severe malarial anemia (SMA). Therefore, the evidence contradicts the claim.

ShadowyZephyr avatar
ShadowyZephyr

@PatrickDelaney LLaMA is not better than ChatGPT. In fact, it's not even close to ChatGPT. Maybe you mean something like GPT-4?

PatrickDelaney avatar
Patrick Delaneyis predicting YES at 39%

@ShadowyZephyr define not even close. What benchmarks and what matgins? Gpt3.5 and Gpt4 are effectively analogous to ChatGPT but we could discuss any of them.

PatrickDelaney avatar
Patrick Delaney
Comment hidden
firstuserhere avatar
firstuserherebought Ṁ100 of NO

@PatrickDelaney yeah so I bet 50 mana which brought the market down to 10% cuz it was new (i didn't know that you had just created it haha), so i sold and brought down to 35%.

As for why I bought NO, science derives from underlying principles of Nature, and we accumulate evidence related to that principle. If the evidence disagrees with the theory we have proposed, then the theory is wrong. If the evidence agrees with the theory, then it might be right. This is a one-way relationship b/w evidence and hypotheses/theory.

In the worldview of LLMs (which is correct for LLMs themselves, but may not be directly coherent for our perception of worldview), the theory and evidence (i think) do not possess such a relationship. Their worldview is based (current systems only) on the correlation and similarities between concepts and is abstracted away in some way.

So, a 1.3X improvement would be quite significant imo. There are a lot of things which are not axiomatic and can be easily learned by LLMs to classify as correct or not correct based on their relationships, but the distinction between (evidence overrules theory) and (evidence and theory exist in the same space closely and are bidirectionally related) is not so clear as far as i know.

MarkIngraham avatar
Mark Ingraham

@PatrickDelaney the reason is everyone realized the ai predictions always fail so they are copying me and voting no to everything. Try it, literally any ai post will immediately get 5 down votes. It's not like I have bots or anything, this is real users fighting back against ai trolling.

Messi avatar
Messi

@MarkIngraham AI taught me to speak english. I knew only España prior to knowing english. That made me happy and able to do things like manifold markets. My youtube recommendations are so good because of AI. You are opinionated and should not be close minded. Regardless of whatever large scale impact AI has (or fails to has, and is just a fad as you say) it has made my life better and that doesn't change regardless of anything else.

MarkIngraham avatar
Mark Ingraham

@Messi English isn't a real language it's Spanish with stuff removed

Messi avatar
Messi

@MarkIngraham you are flat out incorrect/wrong. English and Spanish have had different origins. Spanish evolved from its Latin roots, English, on the other hand, has its roots in Anglo-Saxon, with Germanic evolution.

MarkIngraham avatar
Mark Ingraham

@Messi that's all fake and it's Latin

Messi avatar
Messi

@MarkIngraham if that logic is followed, Mark, then you're all fake cuz it's your mum, and you're mum's fake cuz it's her mum, and so on. And if English isn't a real language and is meaningless, so is all you say (Which is nonsense regardless of what language you vomit in)

Chopping off your balls doesn't make you a girl, try you may, and there are differences in languages that maybe a peasized brain may not be able to grasp, it's ok, i have pity towards you

MarkIngraham avatar
Mark Ingraham

@Messi English is how humans speak, all other languages are pointless and are going extinct.

firstuserhere avatar
firstuserhereis predicting NO at 35%

@MarkIngraham Thats false, and not even 30% of the world is English speaking primarily. If English is the only language you're able to understand, then in English, is this all you're capable of? Trolling on every other market with blatantly wrong info? Don't reply to this comment man, I've heard decent arguments from you as well as bad af takes. Just pause, take a look at what you're doing. Stop it, get some help.

MarkIngraham avatar
Mark Ingraham

@firstuserhere your information wrong and all other languages are going extinct.

PseudonymousAlt avatar
PseudonymousAlt

@MarkIngraham feel free to make a prediction market about it.

JohnSmithb9be avatar
John Smith

@Dreamingpast Mark’s what some experts refer to as an internet shithead. Disregard his verbal diarrhea.

PatrickDelaney avatar
Patrick Delaney

@MarkIngraham please refrain from trolling and toxic behavior or I will block you from all of my markets.

JohnSmithb9be avatar
John Smith

@PatrickDelaney He was comment banned earlier today.

firstuserhere avatar
firstuserhereis predicting NO at 35%

@JohnSmithb9be yep. @PatrickDelaney mind if we collapse this thread and start another comment thread on the discussion that was gonna happen above?

PatrickDelaney avatar
Patrick Delaney

@JohnSmithb9be OK thank you. By the way you may be right to call him a sh*thead or whatever, but please keep it light if you don't mind, even with these folks. I understand it's frustrating when people comment like this, I just really don't want flame wars or to antagonize potentially unstable people further or whatever. I think it's OK to be aggressive and try to win the bet however, don't want to stop you from that. Does that make sense or no?

PatrickDelaney avatar
Patrick Delaney

@firstuserhere Yes, how do I do that? Just start a new one and ignore this one?

firstuserhere avatar
firstuserhereis predicting NO at 35%

@JohnSmithb9be yep. Otherwise everyone in this comment chain will keep getting notifs for that

Related markets

Will A.I. Be Able to Make Significantly Better, "Common Sense Judgements About What Happens Next," by the End of 2023?54%
Will A.I. Become Significantly Better at Drug Discovery in 2023?49%
Will A.I. Get Significantly Better At, "Community Based Ethical Judgements," by the End of 2023?77%
Will A.I. Achieve Significantly More, "Linguistic Temporal Understanding" by end of 2023?29%
Will A.I. Achieve Significantly Higher Performance Over "General Conceptual Skills" in 2023?66%
Will A.I. Be Significantly Better, "Able to Track Changes in State," By the End of 2023?8%
Will A.I. Be Able to, "Feel and React to Pain," Significantly Better By the End of 2024?34%
Will A.I., "Hallucinate Significantly Less," by the End of 2023?50%
Will A.I. Be Significantly Better at, "Egocentric Navigation," by the End of 2023?41%
Will A.I. Be Able to Meet Just Below Human Performance In Being Able to "Track Changes in State," By the End of 2023?27%
Will AI Be Able to Understand the, "Meaning" of Questions Significantly Better By the End of 2023?62%
Will A.I. Get, "Scarily Better at Helping To Resolve Shipping Supply Chain Issues," by the End of 2024? (Re:HuggingFace)64%
Will an AI be able to convert recent mathematical results into a fully formal proofs that can be verified by a mainstream proof assistant by 2025?45%
Will an AI produce encyclopedia-worthy philosophy by 2026?37%
Will AI fully automate Cochrane-style systematic reviews by end of 2024?14%
Will AI win a Pulitzer by 2025?7%
Will I consider myself a full-time AI alignment researcher by the end of 2023?90%
Will AI Impacts publish another Expert Survey on Progress in AI by the end of 2025?36%
Will anyone very famous claim to have made an important life decision because an AI suggested it by the end of 2023?42%
AI Surgery by 2025?10%