Can my friend tell GPT-4 from respected bloggers? (Lady tasting tea experiment)
Basic
29
Ṁ15k
resolved May 10
Resolved
NO

The rules, according to my friend: "You pick posts from a few respected blogs from before 2022 (no cherry-picking) and also generate GPT-4 blogs (cherry-picking allowed). Say 4 of each. I bet I can pick out which are which given a decent length sample."

Resolves YES if he can tell them apart correctly (at the 1.4% significance level), NO if the null hypothesis is not rejected.

Get
Ṁ1,000
and
S3.00
Sort by:
predicted YES

Damn. I was decently confident I could tell the difference but did no better than chance. I do notice a few clues in retrospect but some of GPT-4’s generations were shockingly good. (And some human posts were a little underwhelming.)

predicted YES

what were the GPT4 posts and real posts?

predicted YES

not getting my ip

predicted YES

I don't see where it says which were gpt4 or real. does manifold have spoiler tags? it'd be fun for us to post our guesses

predicted YES

I just blind guessed and got all of them right (confirmed with google)

predicted YES

(obviously within the last 20 minutes, not within the last two minutes)

don't post the answers w/o a spoiler so other people can do it.

also, how many was he off by?

predicted NO

@jacksonpolack answers (obviously spoils the whole thing): https://pastebin.com/3zVtkjFs

predicted YES

@jacksonpolack Two right and two wrong. So no better than random chance 😔

predicted YES

!!Indirect spoilers!!: Even having read them, still a bit surprised you were tricked - the writing styles are very clearly different - the GPT-4 ones have very smooth flow to the words (not in a good way) with lots of connecting clauses and unnecessary topic sentences and similar, whereas real ones have more awkward phrasings and points that are at least a little bit hard to understand because the author cares more about communicating something than words flowing. The non gpt4 blogs are the densest - they have interesting ideas and pack more content into sentences and paragraphs. I didn't pick up on who GPT was supposed to imitate at all. The least generic GPT post was <censored> but precisely because it was less generic it made several meaningful mistakes.

I wonder how much of that is whatever finetuning/rlhf/... openai did on top of the base model though, especially the writing style parts.

If possible, could you give an example of a respected blog post that may or may not be used?

Are the blogs you are picking/generating about topics your friend has decent (70th percentile of western educated people or above) knowledge of?

For 8 trials, to get p<0.05 you'd need to get 7/8 correct. Based on that I think this is slightly high.

@DanMan314 oops, good point, I should've used the actual number, which is the 1.4% significance level. I've updated the description to match.

@AndrewG So it requires all 8/8?

@AndrewG 7/8 is not a possible guess since they know it's 4 and 4 ?

predicted NO

@Odoacre Depends on the guesser's utility function. They could want to hedge even knowing it guarantees not getting them all right.

Is/was the test using ChatGPT, or the GPT-4 API? I'm guessing the former, since it doesn't specify. I'd expect ChatGPT-4 to be worse at imitating blog posts (the "assistant" character tends to leak into the generated text) than the API-based model with a decent system prompt.

@ML GPT-4 API

@AndrewG Thank you!

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules