Is more than 90% of the Pile Dataset (~training data for GPT4) plausibly written by men?
closes May 31

Procedure for evaluation:

  • Truly random sample of 100 documents from Pile

  • Establish authorship of sample if possible

  • Release numbers

Sort by:
jacksonpolack avatar
jackson polackis predicting NO at 16%

I used a (poorly randomized) sample of a few documents from pile and got (note: for some of these, gender was guessed from name):
unknown 9
male 8
female 2 or 3 (two confirmed, one has a name that seems female, i think they were married to a woman so probably male) - two of sci grant/paper one fashion blogger

MartinRandall avatar
Martin Randall

@jacksonpolack so 20 authors, 18 are "plausibly written by men"? Including as plausible all of men + unknown + "married to a woman so probably male".

How plausible does the gender have to be?

How are we counting jointly authored documents?

ArthurConmy avatar
Arthur Conmybought Ṁ10 of YES

What makes you think the Pile is roughly GPT-4 data?

Pile-CC is like 1% vile material and duplicated a lot. I would be somewhat surprised if these haven't been addressed

Related markets

Will more than 5% of GPT-4’s training data be YouTube transcripts?22%
Will GPT-4 be trained on more than 10T text tokens?72%
Is GPT-4 a mixture of experts?55%
Has the paper "Can GPT-4 Perform Neural Architecture Search?" significantly increased P(doom)?25%
Will we train GPT-4 to generate resolution criteria better than the creator 50% of the time by the end of 2023?27%
Was GPT-4 trained in 4 months or less?60%
By 2024, a significant fraction of philosophers (>20%) take seriously the notion that language models with a size and architecture similar to GPT-3 are partially conscious11%
Did OpenAI intentionally handicap GPT4's image modality's ability to identify people?68%
By 2024, GPTs are proven to be able to infer scientific principles from linguistic data.47%
Will multiple credible sources claim GPT-5 is sentient?31%
GPT-4 #5: Will GPT-4 be a dense model?46%
Will GPT 4 write a 60,000 word book from a single prompt?38%
Will GPT-4 be trained (roughly) compute-optimally using the best-known scaling laws at the time?51%
Will GPT-4 be a superhuman coder?2%
GPT-5 trained with >=24k GPUs?52%
Does Wolfram significantly improve GPT-4's MATH performance (more than 10%)?75%
Will GPT, or AI systems that have GPT as their main component, become as reliably factual as Wikipedia, before 2026?43%
Will there be a language model by OpenAI called GPT-5, this decade?89%
Will there be federal regulations against the use of text-generation AI models like GPT in some federal submissions?33%
Is GPT-4's human exam performance mainly due to memorisation?4%