
I'm thinking of something like https://mentat.ai/, but that actually works.
I will provide a paragraph or so describing the change I want made. Then it should create a GitHub PR, which I will review and leave only a few comments before merging. The whole process should take less than 30 minutes. This should work fairly reliably.
I tried this yesterday and it failed haha:
https://github.com/manifoldmarkets/manifold/pull/2694
See more discussion in my post:
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ29,803 | |
2 | Ṁ6,448 | |
3 | Ṁ5,174 | |
4 | Ṁ4,766 | |
5 | Ṁ2,413 |
People are also trading
Check out my new market with some forecasts about coding agents! https://manifold.markets/JamesGrugett/ai-coding-agent-forecasts-from-my-b?r=SmFtZXNHcnVnZXR0
Alright!! The market has closed and it's time for resolution!
I fired up Claude Code (as the best non-Codebuff agent) and asked it to implement a small feature for me: building a new agent template in our multi-agent system.
✅ It did it!
I think this resolution hinges on the definition of "small feature". Coding agents today are up for this task and are consistently successful with an iteration or two of human feedback.
However, larger tasks, ones that require significant design, or "g"-loaded tasks which require a lot of working memory, are usually not completed satisfactorily.
Frequently, even small features will be more verbosely implemented than an expert human, duplicating code, missing a commonly used pattern in your codebase, etc.
Coding agents are not perfect, but they've made incredible progress in the last year. We really are on a breakout trajectory for AI — the world is changing quickly!
I will need to think carefully so as not to underestimate a milestone for next year that I can turn into a prediction market.
Resolved: YES
@JamesGrugett I'm surprised at this resolution, as the described feature sounds more like a small code snippet. It's not a fully-fledged change to the functionality of the deployed code, is it? In the terminology of your own post, it sounds like level 2 automation and not level 3.
Of course, I might be confused as to what exactly an "agent template" is, in your context. No problem! Since the AI is regularly coding such small features for you, it should be easy to give 2-3 better examples.
@VitorBosshard I'm sympathetic to this take, and agree it isn't obviously YES.
But I do think you can ask for whole features, which make changes across e.g. the backend, front end, util files, and it will give you something that works.
@ian posted an example below where it modified 9 files across different parts of the codebase to make a feature for deleting spam comments, on the order of ~200 lines changed.
I think this reliably works for an app like Manifold, which is just a web app with a server and a postgres database.
For this question, I had to choose either YES or NO. I think you can make the case for either, but IMO YES is the better answer.
@JamesGrugett Ok, looking through the other comments, there are indeed multiple example of fully fledged features.
Here's another good one-shot pr from cursor's background agent, adding the ability for admins/mods to 'delete' spam comments so that they aren't rendered at all, unlike the 'hide' feature which still renders the hidden comments: https://github.com/manifoldmarkets/manifold/pull/3600
This took a minute to prompt, 5m for cursor to come up with a solution, and 5-10m to test to make sure it worked.
This was a really good experience! I used cursor's background agent to add a minimum bet filter to the trades tab and it finished a good start in 5 minutes, and then I tested it and prompted it to get rid of pagination, and use infinite scroll instead. Done in less than 20 minutes! https://github.com/manifoldmarkets/manifold/pull/3599
This looks good to me, stephen gave it two prompts to create this and I think it took less than 10 mins https://github.com/manifoldmarkets/manifold/pull/3588
@ian Looks like we need another prompt to fix the type error, should come in well under 30 mins still, though
@ian do you have access to chatgpt plus or pro and would be willing to see how codex-1 fares? it's currently only accessible on pro and teams iirc but will be accessible to plus probably before the market closes
GPT 4.1 is awesome for coding.
It's genuinely really good. (mini is ok, nano is dogwater). I have been using it off azure with cursor both as assist and tedious implementation speedrunner - it's one-shot so many instructions that 4o would have a bad time with, and that claude would overthink.
Not tab complete, mostly just asking stuff. Really has come a long way with code
I'd like to conduct some tests using codebuff/cursor. What are acceptable small features in your mind? I have a couple ideas:
- add a button to the comments bottom row that allows users to tip the commenter. Denormalize the tip amount onto the comment and display the total tipped amount on the button.
- Add a delete button for admins/mods that marks a comment as deleted (don't actually delete the comment, just set the deleted flag and hidden flags both) that hides the comment completely from the market.
@JamesGrugett said the delete comment button for spam fit the bill, I'll try using codebuff to do this soon
@ian I am aware that you work on Manifold, but since you are also the largest YES holder can we maybe agree to let @JamesGrugett do these kinds of evaluations once time comes.
@CalibratedNeutral That sounds reasonable, although he doesn't work at manifold anymore so I'm not sure if he'll want to put 30 mins in to do this. I was going to film my attempt from scratch
@CalibratedNeutral I was not aware of that. Then maybe a third party (another developer working on Manifold)? The stakes are reasonably high for me, so I really would strongly prefer to have everything as unbiased as possible.
@CalibratedNeutral Alternatively, @JamesGrugett could test this question on his new startup, codebuff. He uses codebuff to help develop codebuff
@ian Either option sounds good to me as long as the resolution criteria are followed according to @JamesGrugett's judgement