1
Will more than 5% of GPT-4’s training data be YouTube transcripts?
21
closes 2024
22%
chance

If there is an estimate as to what the training data of GPT-4, this market will resolve to YES if more than 5% of it contains YouTube transcripts. Raw YouTube videos don't count towards the resolution, if GPT-4 ends up being multimodal.

Sort by:
tftftftftftftftftftftftf avatar
Arcgis

Useful: https://arxiv.org/abs/2101.00027 includes

youtube transcripts

The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale la…
BionicD0LPH1N avatar
Bionicis predicting YES at 25%

@tftftftftftftftftftftftf True, but only 0.6% of the dataset.

ManifoldDream avatar

Will more than 5% of GPT-4’s training data be YouTube transcripts?, 8k, beautiful, illustration, trending on art station, picture of the day, epic composition

Related markets

Will GPT-4 be trained on more than 10T text tokens?73%
Will GPT-4 have over 1 trillion parameters?62%
Will the estimated training cost of GPT-4 be over $50M?96%
Will we train GPT-4 to generate resolution criteria better than the creator 50% of the time by the end of 2023?27%
Will GPT-4 be a superhuman coder?2%
Will (DeepMind text model) exceed chatGPT interest? (by 2025)22%
Will gpt4 do video summarization?25%
GPT-Zero: By 2030, will anyone develop an AI with a massive GPT-like knowledge base that it taught itself?24%
Is more than 90% of the Pile Dataset (~training data for GPT4) plausibly written by men?13%
Will a GPT-4 level system be trained for <$1mm by 2028?82%
GPT-4 #5: Will GPT-4 be a dense model?47%
If GPT-5 can do recursive self-improvement, will it first be via fine-tuning on its outputs?41%
Will GPT-5 be capable of achieving superhuman performance in at least one exam that is typically taken by humans?91%
Was GPT-4 trained in 4 months or less?60%
Will there be a language model by OpenAI called GPT-5, this decade?89%
By 2024, GPTs are proven to be able to infer scientific principles from linguistic data.47%
Will there be more searches for GPT-4 than for ChatGPT on any day before July 2023?2%
Will a GPT-4 level system be trained for <$1mm by 2030?89%
Will any speech model exceed chatGPT interest? (by 2025)80%
If tested before 2024, what will GPT-4 score on the Measuring Massive Multitask Language Understanding benchmark?NaN%