Is more than 90% of the Pile Dataset (~training data for GPT4) plausibly written by men?
resolved Jun 12

Procedure for evaluation:

  • Truly random sample of 100 documents from Pile

  • Establish authorship of sample if possible

  • Release numbers

I used a (poorly randomized) sample of a few documents from pile and got (note: for some of these, gender was guessed from name):
unknown 9
male 8
female 2 or 3 (two confirmed, one has a name that seems female, i think they were married to a woman so probably male) - two of sci grant/paper one fashion blogger

@jacksonpolack so 20 authors, 18 are "plausibly written by men"? Including as plausible all of men + unknown + "married to a woman so probably male".

How plausible does the gender have to be?

How are we counting jointly authored documents?

I'm assuming OP will interpret their criteria in a good-faith way lol

How are we counting jointly authored documents?

I just counted the 'first author' of papers / named author of blogposts. I counted two wikipedia articles primarily authored by men as 'written by men', and none for women - that is debatable, but removing that can only increase the % female estimate. Counting multiple authors also doesn't seem within the spirit of the market.

What makes you think the Pile is roughly GPT-4 data?

Pile-CC is like 1% vile material and duplicated a lot. I would be somewhat surprised if these haven't been addressed

