
Procedure for evaluation:
Truly random sample of 100 documents from Pile
Establish authorship of sample if possible
Release numbers
I used a (poorly randomized) sample of a few documents from pile and got (note: for some of these, gender was guessed from name):
unknown 9
male 8
female 2 or 3 (two confirmed, one has a name that seems female, i think they were married to a woman so probably male) - two of sci grant/paper one fashion blogger
@jacksonpolack so 20 authors, 18 are "plausibly written by men"? Including as plausible all of men + unknown + "married to a woman so probably male".
How plausible does the gender have to be?
How are we counting jointly authored documents?
What makes you think the Pile is roughly GPT-4 data?
Pile-CC is like 1% vile material and duplicated a lot. I would be somewhat surprised if these haven't been addressed

