Resolves per my judgment. I currently write the Manifold weekly newsletter, so I think I'll be a good judge as to whether it can do this.
Basically, it will need to identify some sort of coherent-ish narrative for markets for the week, write a blog post-length article, incorporating about a dozen-ish Manifold markets screenshots, and be at least passably engaging and meaningful to read, not just AI slop.
I won't require it to format it in Substack or add hyperlinks to the text or whatever, but I will expect it to identify which topics are actually linked, explain the markets to some degree, and generate filler text. If it can't handle screenshots but can put placeholder images or links to each of the markets, that would probably be okay too, depending on how it does it.
This would slot in at a task length of, perhaps, 4 hours on the METR benchmark, in my opinion. AI systems are forecast to hit 4 hours well before the end of 2026, so this is a good gauge on how BS that benchmark is, from my perspective.
I will not be picky on if it can do it in one shot or say, with a couple of clarifying answers (as Deep Research now seems wont to do), and I will probably test it with a couple prompts, so it's not important if it can do this 100% of the time or just, say, 1 in 3 times.
I expect that there's a decent chance that there's gray area in the resolution of this market. Given this:
1) I will not bet on this market.
2) I will resolve in the spirit of the market. (Basically, can AI do this thing that I can do in about 3-5 hours: amalgamate a week's worth of current events and Manifold markets into a newsletter.)
3) If there is an extremely unique gray area which threads the needle between a YES and NO resolution, I will resolve to a PROB such as 50%, but I will attempt to resolve YES or NO if at all possible.
People are also trading
This would slot in at a task length of, perhaps, 4 hours on the METR benchmark, in my opinion. AI systems are forecast to hit 4 hours well before the end of 2026, so this is a good gauge on how BS that benchmark is, from my perspective.
They don't claim it's the time horizon of AI in all tasks though, so I don't see how this makes sense as criticism of METR themselves. Ppl do infer stuff about other domains from this benchmark's result and it's not silly to attempt doing so, it's just silly to assume it will hold 1-to-1 with other domains
@bens What about this as a way to resolve the market? In 2026 you WILL do one AI written newsletter, with minimal formatting. If people can tell that it’s AI slop, the market resolves no. If they can’t, it resolves yes. This is less ambiguous and maybe builds (some) interest in the newsletter.
@bens I think this could be tricky to operationalize, and I'd probably want to make it as a separate market, lol, also given that the current capabilities of AI fall a good deal short, I don't want to subject the audience of Above the Fold to reputation-degrading slop
fwiw I think it's likely that frontier models will retain some identifiable AI-isms that make it decently easy to tell that something is written by AI, despite being content-wise totally passable and engaging. If someone sees AI-isms and immediately has a strong negative reaction, they may view the generated newsletter negatively for example, but imo this market makes more sense if it tries to exclude style-level AI-isms and focuses on content / overall narrative. Basically if someone from 2015 read it and didn't know about AI-isms, would they be engaged. But that's just my opinion, lmk what you think ben
@bens I think it would be fun to read. And as someone who doesn’t always read the above the fold column, it’d give me extra insentive to find the fake. . . . Also, AI is getting better at imitating style. And the prompt could include rules about em dashes or what have you — though to be honest, as someone who uses em dashes, I think that tell is becoming an overrated indicator.