Is more than 90% of the Pile Dataset (~training data for GPT4) plausibly written by men?
closes May 31

Procedure for evaluation:

  • Truly random sample of 100 documents from Pile

  • Establish authorship of sample if possible

  • Release numbers

jacksonpolack avatar
jackson polackis predicting NO at 16%

I used a (poorly randomized) sample of a few documents from pile and got (note: for some of these, gender was guessed from name):
unknown 9
male 8
female 2 or 3 (two confirmed, one has a name that seems female, i think they were married to a woman so probably male) - two of sci grant/paper one fashion blogger

MartinRandall avatar
Martin Randall

@jacksonpolack so 20 authors, 18 are "plausibly written by men"? Including as plausible all of men + unknown + "married to a woman so probably male".

How plausible does the gender have to be?

How are we counting jointly authored documents?

ArthurConmy avatar
Arthur Conmybought Ṁ10 of YES

What makes you think the Pile is roughly GPT-4 data?

Pile-CC is like 1% vile material and duplicated a lot. I would be somewhat surprised if these haven't been addressed

