
In order to resolve YES, someone (could be myself) must provide an image model and one or two prompts to test it with.
The image model is any program capable of generating arbitrary images. It can use any method to do so, but it must be general. An LLM that writes code to draw simple geometric shapes does not qualify.
If there's any question over whether a program should qualify, I'll require that it's able to generate the polygons with some other quality that current image models can already do. Maybe it has to be in a specific style, or a person is holding the polygon, or whatever. The submitter can choose anything that's sufficient to prove this is a general image-generation program.
If the input is fed through an LLM or some other system before going into the image model, this pre-processing will be avoided if I can easily do so, and otherwise it will not.
For side numbers 2-8, I will use the shape names from triangle to octagon. For side numbers >8, I will enter a number of sides, either using digits or with spelled out numbers, submitter's choice. I'll test all numbers from 9-20, and 5 random numbers from 21-50.
The prompt can be anything, but it must be consistent. The same string of text except for changing the descriptor of the shape I want. (One prompt for 3-8 and a different prompt for >8 is fine.)
For each attempt, at least 50% of the generated images must be unambiguously the specified shape. It's ok if there's other stuff in the picture, the polygon is pictured at an angle, or there are other distractions. But if there's any debate over whether the specified regular polygon is actually in the image somewhere, it doesn't count. If the resolution is too low for me to tell, I will assume it's not the correct shape.
If any attempt fails, the entire test fails. It must pass for every side number I test.
People are also trading
@IsaacKing you've significantly changed the resolution criteria again. I'm never betting on one of your markets again, that's for sure.
@mods Take a look at the history and compare where this question started and where it is now. Is this kind of goalposting shifting really acceptable on Manifold? In particular, I think the new requirement that the polygon be embedded in a more general-purpose image is a significant departure from where the question was just a few revisions ago, and is very far from the original market.
@DanHomerick I'm not speaking in an official capacity here, but I think a particularly big issue is that the markets with ambiguous criteria and frequently changing unambiguous criteria attract the most traders because people see them and—often quite rationally—think the market is mispriced and correct it or take out large limit orders. So the worst markets end up being the most liquid and therefore most visible and most attractive, in a vicious feedback loop.
My reading of the comments and discussion feels like it's making a good faith effort to try to improve the criteria to clarify questions.
I think if I had a note, it would just be that I prefer making clarifications that an author isn't sure of while the market is closed and having time to discuss them
@DanHomerick I think the current text makes it more likely to resolve YES because it states that any program that can generate images with polygons and (eg) people qualifies. That includes a program that uses generated code to get the 49-sided shape pixel perfect in combination with generated images of people at 2024 levels of fidelity.
@IsaacKing Perhaps it means it would reach 100 because 192>100, and the properties of a 192-sided figure are similar to that of a 100-sided figure?
@jwith I believe it would have taken you less time to google the answer than to make this comment. :)
@MalachiteEagle It’s very low compared to the original resolution criteria, and in my opinion indicates a significant departure from the original version
@JimHays +1 on this being not an ambiguity but an explicit inconsistency with the original criteria! Concerning.
@IsaacKing wait what? I'm so confused by this change. I assumed it was just triangles through octagons? HUMANS can't even reliably produce images of a 49-sided polygon.
@bens This market isn’t benchmarked against human performance though, so I’m not sure why human performance would be relevant here?
Even sticking with named regular polygons you’ve got the chiliagon, myriagon, and megagon, which most people have never heard of, but which should all be necessary for a YES
@MalachiteEagle @bens Not sure why you guys had that impression, the market has always been very clearly about all polygons. The title says "every regular polygon", and the original description confirmed explicitly that those above 8 sides are included.
@JimHays @Jacy I added the 50-side cap because people were pointing out that I cannot test every positive integer, and thus need to test only a subset. There were questions about what this subset would be, so I formalized it now to avoid arguments about it later. (e.g. a NO bettor continuously claiming "well maybe none of the ones you've tested so far have failed, but I want you to test some more".)
I don't foresee this mattering? Any number >50 is just going to look like a circle anyway. But if you think this is relevant, I'll happily test higher numbers too. Feel free to suggest a procedure you'd be comfortable with. It's not my intention to change the market from its original, just to remove ambiguity in how it will resolve. (There were some concerns over the resolution of my previous market on pentagons, I don't want a repeat of that.)
@IsaacKing Thanks for working on formalizing the criteria, I do think that’s valuable to work out ahead of time.
While you’re right that above a certain limit the shapes should all look approximately identical, iI think it would be valuable, if the test even goes that far, to test some larger numbers to ensure that the model does properly generalize.
I’d propose at least adding chiliagon, myriagon and megagon (knowing these terms shouldn’t be part of the test, so if it asks what they are, giving a definition is fine), as well as the 21 digit number mentioned below in the second comment thread on the market.
If all of these work correctly, I think NO bettors should get a couple days to vote on up to, say, at least two additional prompts to test, in case they discover that there’s some kind of out of distribution error that the model struggles with that’s hard to predict upfront (E.g., it can’t do shapes where the number of sides looks like a year, or it messes up on 404, 538, or other numbers that have strong meaning associated with them).
To address this ahead of time, what if the model refuses to draw certain shapes, such as a 420- or 666-sided polygon?
@JimHays Ok here's a simple solution: the NO bettors can submit any ten integers > 20 that still fit within the prompt length limit and I'll include them in the test. (Must be denoted in decimal.) Is that satisfactory?
@IsaacKing okay but obviously it can't generate a 10^3-gon or whatever, since there's a finite number of pixels in an image? I think a reasonable assumption was that you meant the simple polygons that children know the names of, like "pentagon", "hexagon", and "octagon", and not literally infinite numbers of shapes
@bens This is already specified below that if the sides cannot be distinguished based on the image generator’s max resolution, it should still look correct for an embedded in that size grid. E.g, a regular 1,000-gon will look like a circle at most resolutions.
@JimHays hmm, okay, makes sense, I guess, although I have no clue how one would attempt to judge that
@bens The goal is to test whether it actually "understands", in a meaningful way, what a regular polygon is. If a human artist could draw 8 sides but upon being asked to draw 9 started giving me totally random geometric shapes instead, I would have some concerns about their mental capabilities.
@IsaacKing this seems like a good criteria to me, except there's a potentially big difference between random and adversarial selection, and it's not clear to me which is intended.
I.e. is the question supposed to resolve YES or NO on a case like "it works on all small n-gons and 95% of large n but breaks on 5% of large n"
@IsaacKing I don't want to claim to speak for all NO holders, but in the interest of balancing having some kind of practical testing limit with the broad criteria of "any regular polygon", I think allowing some kind of adverse selection of a subset of larger shapes makes sense to me. I would prefer that these not need to be pre-specified right now though, as
- knowing the output image resolution of the models this is being tested against would be an important criteria in determining the most informative tests to run, and that value could increase over the course of the year, and
- giving market participants more time to run their own tests could help them determine if there are any particular blindspots that the models have that should be included in the test.