Will DALLE-3 create correct text for names in English, in images?
35
690Ṁ3482
resolved Jan 2
Resolved
N/A

I will run a prompt: "The name "*" 3d render" where * will be replaced with each name from the following:

Joshua

Marcus

Austin

Brian

Isaac

Victor

Chris

Phil

Martin

Alice

Bob

Eliza

Dylan

Nicole

(these are 14 names)

For each name, I will try 10 times.

If DALLE3 gets the name correct 6 or more times out of 10, then DALLE3 gets 1 point for that name.

If it gets it correct 5 times out of 10, it gets 0.5 points.

For 0-4 correct out of 10, it gets 0 points.

If DALLE-3 achieves a score of >7 out of 14, this resolves YES. Otherwise NO.

Get
Ṁ1,000
to start trading!
Sort by:

Aw, why'd this N/A?

I plan to resolve this in the next 2 days

predictedNO

If anyone is interested in getting my shares for 21% with a limit order, lmk

predictedNO

(price negotiable since it’s moving around)

They seem less likely to improve their models til the end of the year now..

I tried all names twice with the current version of Dalle-3 on the chat.openai.com website. First pass over all names: https://chat.openai.com/c/adcae3c1-5a5c-4276-97ad-3f13f79a495a. Second pass over all names: https://chat.openai.com/c/50b89107-0afd-448e-9044-fd419e1b4329


The format is “[name] [try 1, success fail or meh (in between)] [try 2, same thing]”. I split them by ; because otherwise the comment is really long.


Joshua fail fail; Marcus fail fail; Austin fail success; Brian fail fail; Isaac fail fail; Victor fail fail; Chris fail fail; Phil fail fail; Martin success fail; Alice meh(probably success) success; Bob success success; Eliza meh(probably fail) fail; Dylan meh (probably success) fail; Nicole fail meh(probably fail)


For the ones I rated meh, you can click on the link to the tests and see for yourself if you think I'm harsh / not harsh enough.

If I naively duplicate each image 5 times, to simulate (with more variance!) the actual test, we get 4.5 points if the mehs are all successes, and 2.5 points if the mehs are all failures. Therefore, with the current DALLE-3 model, it seems pretty likely (if I’m not missing something significant) that this market resolves NO. Looking at the generated images, longer names are pretty far from being correct, but shorter ones (especially Bob and Alice) are pretty consistent and successful.

It is possible that:

  • my prompting is suboptimal or misguided in some way

  • the DALLE-3 model used on the chatgpt website is different from the API one (if it’s different it’ll probably be worse)

  • DALLE-3 improves significantly before EOY

  • 2 samples from each name was too little, especially since it biases toward getting 0.5 points which is unlikely in the full test. (only) 7 names got 2/2 fails.

I've run additional tests, in total they are here (includes previous ones):
https://chat.openai.com/c/adcae3c1-5a5c-4276-97ad-3f13f79a495a

https://chat.openai.com/c/50b89107-0afd-448e-9044-fd419e1b4329

https://chat.openai.com/c/220a664d-7864-413b-8e66-62a54f764510

https://chat.openai.com/c/4a6ec368-a28b-450f-9adb-c5316ee69406

https://chat.openai.com/c/f1d9b6cc-fd3f-4065-88f5-7f1613a58d7d

And the results (row is a name, column is the nth try for that name, 5 total so far; 1 is success, 0 is failure, M is maybe (not sure which it is):
Joshua: 00010
Marcus: 00010
Austin: 01000
Brian: 00000
Isaac: 00000
Victor: 00M00
Chris: 00000
Phil: 00000
Martin: 10000
Alice: M1100
Bob: 11011
Eliza: M0M00
Dylan: M0011
Nicole: 0MM0M

if the M's are 1, results in 4 points. if the M's are 0, results in 1 point.

predictedYES

@TheBayesian interesting. I tested this with Bing and that version did manage to get it consistently enough

predictedNO

@Shump I’m quite concerned by the fact that the resolution date got extended and they might improve the model.. or bing does something different and better somehow, huh

@TheBayesian That's right. I should've set a deadline when I created the market.

I remember doing some testing and DALLE-3 was not creating text very well. However, just now i did a pass over all these names, and it seems like the model has improved quite a lot - it seems to get most of the names correct most of the time.

In consideration of that, I'm going to cancel this market contract, as the traders bet on a weaker version of the model than what is being tested, and this should've been an explicitly clarified condition in the market.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules