By 2025 will there be a competitive large language model with >50% of the total training data generated from a large language model?
8
170Ṁ200
resolved Mar 14
Resolved
NO

Large meaning >= 20b parameters.
Competitive meaning the benchmark results are close, or better, than a model trained with only human text.

Computer generated text does not count, it has to be the output of a language model. For example, converting code to an intermediate representation of a compiler and training only on that would not count.

Processed text is valid, as long as it's sourced from the language model.

Multiple stages of training are fine, so if there's a training period on only human text, as long as the total AI training examples are > 50% of the total training examples over the entire training this will resolve yes.

Market resolves on March 1st 2025 to account for announcement of models trained in the last half of 2024.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ33
2Ṁ26
3Ṁ1
4Ṁ0
Sort by:

@dmayhem93 do you have stats on this from AIs known to be trained partly on model-generated text?

@MartinRandall phi 3.5 does not give a percentage of synthetic data in it's report, and that's the only one I'm aware of that has a slight chance of being considered here

A model extraction attack is enough to resolve this yes, right? Or any kind of distillation process where we train a model and use its output to train a model?

Does Constitutional AI count?

predictedYES

@MartinRandall It would have to be on text, not on logits, so like alpaca and friends are fine if they scaled it up to 750b tokens, but a traditional student/teacher setup is not.

Constitutional AI would count yeah, if it was >50% of the total.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules