I will run a prompt: "The name "*" 3d render" where * will be replaced with each name from the following:
Joshua
Marcus
Austin
Brian
Isaac
Victor
Chris
Phil
Martin
Alice
Bob
Eliza
Dylan
Nicole
(these are 14 names)
For each name, I will try 10 times.
If DALLE3 gets the name correct 6 or more times out of 10, then DALLE3 gets 1 point for that name.
If it gets it correct 5 times out of 10, it gets 0.5 points.
For 0-4 correct out of 10, it gets 0 points.
If DALLE-3 achieves a score of >7 out of 14, this resolves YES. Otherwise NO.
I tried all names twice with the current version of Dalle-3 on the chat.openai.com website. First pass over all names: https://chat.openai.com/c/adcae3c1-5a5c-4276-97ad-3f13f79a495a. Second pass over all names: https://chat.openai.com/c/50b89107-0afd-448e-9044-fd419e1b4329
The format is “[name] [try 1, success fail or meh (in between)] [try 2, same thing]”. I split them by ;
because otherwise the comment is really long.
Joshua
fail fail; Marcus
fail fail; Austin
fail success; Brian
fail fail; Isaac
fail fail; Victor
fail fail; Chris
fail fail; Phil
fail fail; Martin
success fail; Alice
meh(probably success) success; Bob
success success; Eliza
meh(probably fail) fail; Dylan
meh (probably success) fail; Nicole
fail meh(probably fail)
For the ones I rated meh, you can click on the link to the tests and see for yourself if you think I'm harsh / not harsh enough.
If I naively duplicate each image 5 times, to simulate (with more variance!) the actual test, we get 4.5 points if the mehs are all successes, and 2.5 points if the mehs are all failures. Therefore, with the current DALLE-3 model, it seems pretty likely (if I’m not missing something significant) that this market resolves NO. Looking at the generated images, longer names are pretty far from being correct, but shorter ones (especially Bob and Alice) are pretty consistent and successful.
It is possible that:
my prompting is suboptimal or misguided in some way
the DALLE-3 model used on the chatgpt website is different from the API one (if it’s different it’ll probably be worse)
DALLE-3 improves significantly before EOY
2 samples from each name was too little, especially since it biases toward getting 0.5 points which is unlikely in the full test. (only) 7 names got 2/2 fails.
I've run additional tests, in total they are here (includes previous ones):
https://chat.openai.com/c/adcae3c1-5a5c-4276-97ad-3f13f79a495a
https://chat.openai.com/c/50b89107-0afd-448e-9044-fd419e1b4329
https://chat.openai.com/c/220a664d-7864-413b-8e66-62a54f764510
https://chat.openai.com/c/4a6ec368-a28b-450f-9adb-c5316ee69406
https://chat.openai.com/c/f1d9b6cc-fd3f-4065-88f5-7f1613a58d7d
And the results (row is a name, column is the nth try for that name, 5 total so far; 1 is success, 0 is failure, M is maybe (not sure which it is):
Joshua: 00010
Marcus: 00010
Austin: 01000
Brian: 00000
Isaac: 00000
Victor: 00M00
Chris: 00000
Phil: 00000
Martin: 10000
Alice: M1100
Bob: 11011
Eliza: M0M00
Dylan: M0011
Nicole: 0MM0M
if the M's are 1, results in 4 points. if the M's are 0, results in 1 point.
@TheBayesian interesting. I tested this with Bing and that version did manage to get it consistently enough
@Shump I’m quite concerned by the fact that the resolution date got extended and they might improve the model.. or bing does something different and better somehow, huh
@TheBayesian That's right. I should've set a deadline when I created the market.
I remember doing some testing and DALLE-3 was not creating text very well. However, just now i did a pass over all these names, and it seems like the model has improved quite a lot - it seems to get most of the names correct most of the time.
In consideration of that, I'm going to cancel this market contract, as the traders bet on a weaker version of the model than what is being tested, and this should've been an explicitly clarified condition in the market.