An AI agent is something which takes a task and directly applies changes to a code base. (Possibly via a merge request, letting a human to review changes.) I.e. it works similarly to giving a task to a programmer.
The market resolves to "YES" if such agents exist by the end of the year and are used in commercial environments, essentially displacing work of programmers.
The agent must work for a mainstream programming language and a commonly used code base format. "AI app generator" which produces something from a template does not count, neither do specialized "no code" environments.
Tools like Copilot do not count - they are designed to help a programmer to write code, not to replace a programmer.
Experiments in a lab settings do not count - it's much easier to operate in a controlled environment.
Hi everyone!
As you may know, on markets like this where the creator is inactive, a panel of three moderators is convened to decide on the resolution. Our job is to do our best to come up with what the creator would have resolved this market as. This job is a bit tricky, since we have to analyze everything that the creator has said about the market, in the title, description, and comments, and there was a lot of things said about this market, many of which were conflicting or ambiguous!
There are some AI tools that would clearly make this market resolve YES if they worked and were used commercially, because they are able to work off of something like a ticket and produce a merge request on a repository. For example, DevGPT, Sweep, and Codegen. However:
DevGPT has no public evidence that it has been used successfully, with some evidence on the internet that it is doesn't work or is a scam
Sweep was used some on some public Github repositories, but all of the commercial companies that tried it seem to have stopped using it after a very brief usage in which it didn't work successfully
Codegen also has no public evidence that it has been used other than tweets from the CEO
There are also some tools that definitely work but are questionable about whether they count. The biggest one in this space seems to be Aider. I personally tried out Aider, and it's certainly a very useful tool. However, Alex Mizrahi stated that "Tools like Copilot do not count - they are designed to help a programmer to write code, not to replace a programmer." So the question is, is Aider a tool like Copilot, or is it something that replaces a programmer? This is a question that ultimately winds up being subjective to some extent. It doesn't operate entirely independently in the way that the other three tools did, but it also goes much further than Copilot, since it is able to find a file in a repository, make changes to it, and then commit those changes.
After lengthy back and forth and reading everything written here, two of the three moderators voted to resolve the market as N/A and the third moderator voted to resolve the market as NO. Since this vote required unanimous agreement, we then have to resort to the fallback option of just resolving the market to N/A.
If Alex returns to Manifold, we encourage him to re-resolve the market in the direction that he thinks is correct. If he would like to do so, he can contact any moderator and have them re-resolve the market per his request.
Hi everyone!
As you may know, on markets like this where the creator is inactive, a panel of three moderators is convened to decide on the resolution. Our job is to do our best to come up with what the creator would have resolved this market as. This job is a bit tricky, since we have to analyze everything that the creator has said about the market, in the title, description, and comments, and there was a lot of things said about this market, many of which were conflicting or ambiguous!
There are some AI tools that would clearly make this market resolve YES if they worked and were used commercially, because they are able to work off of something like a ticket and produce a merge request on a repository. For example, DevGPT, Sweep, and Codegen. However:
DevGPT has no public evidence that it has been used successfully, with some evidence on the internet that it is doesn't work or is a scam
Sweep was used some on some public Github repositories, but all of the commercial companies that tried it seem to have stopped using it after a very brief usage in which it didn't work successfully
Codegen also has no public evidence that it has been used other than tweets from the CEO
There are also some tools that definitely work but are questionable about whether they count. The biggest one in this space seems to be Aider. I personally tried out Aider, and it's certainly a very useful tool. However, Alex Mizrahi stated that "Tools like Copilot do not count - they are designed to help a programmer to write code, not to replace a programmer." So the question is, is Aider a tool like Copilot, or is it something that replaces a programmer? This is a question that ultimately winds up being subjective to some extent. It doesn't operate entirely independently in the way that the other three tools did, but it also goes much further than Copilot, since it is able to find a file in a repository, make changes to it, and then commit those changes.
After lengthy back and forth and reading everything written here, two of the three moderators voted to resolve the market as N/A and the third moderator voted to resolve the market as NO. Since this vote required unanimous agreement, we then have to resort to the fallback option of just resolving the market to N/A.
If Alex returns to Manifold, we encourage him to re-resolve the market in the direction that he thinks is correct. If he would like to do so, he can contact any moderator and have them re-resolve the market per his request.
While I disagree about the outcome, I am satisfied and impressed with the effort and process the mods went through to reach it.
Thanks @Gabrielle!
@Putcallparity most of the discussion is done on discord. You can go there to see why YES is also a valid result.
I have no stake here but I wanted to offer an opinion. This market should definitely be N/Aed. It's a subjective market, with an absent creator and a lack of consensus. It's a classic case for ambiguous markets that require N/A.
It's pretty easy for come up with interpretations of "used commercially" and "ai agents" that either have been fulfilled or haven't. Plenty of digital ink has been spilled in the comments below on which interpretation is most correct, but there is still no consensus. Haven't seen much more obvious cases for N/A
@Shump It may end up like This Market and end up being put up to a random mod panel. Not 100% sure, but seems likely.
@AlexMizrahi
Where are we at with this?
I messaged you 11 days ago and you seem to be AFK.
Please update us.
@SirCryptomind I think Alex has been clear that resolution might take a while, e.g.,
But given negative evidence above I think we can resolve this before the end of February, no need to wait till July.
@Jacy No, that's incomplete. The resolution is YES, and Alex hasn't found time presumably for evaluating it yet.
@firstuserhere it looks like Alex's most recent comment said:
... I'm leaning towards a NO resolution ...
so assuming a YES resolution seems a little odd. In any case, I think we agree that—while updates would be nice—we shouldn't expect Alex to resolve the market in the immediate future.
@Jacy Yes, that's a wrong initial judgement he arrived at. The market asked for X,Y and Z, all 3 of which have been fulfilled. Some other things like W or K or Y is irrelevant. I'm sure if you're arguing in good faith, you can see that too.
@firstuserhere as of Dec 31, ~597 people disagreed with you and ~427 agreed. Notably, that count was a much larger majority, 46 to 14, in terms of bettors of at least M1000. If some new evidence has come out since Dec 31 that would change all their minds (or at least those dealing in good faith), presumably you would see comments to that effect.
I'm not saying that you're unambiguously wrong, just that the case for YES doesn't seem as clear as you make it out to be.
@Jacy that's not true. It's not even relevant. This isn't a poll. And most users there are inactive. The top NO holder is ukranian mana-scammer/farmer running 30+ accounts. come on.
In fact, even when unambiguous criteria has been met, people hold NO exactly because of stuff like this. Nonsense over creators judgement instead of reality.
I could make the case that the market close at 60% means it should be more likely to be yes than no, but that's just silly.
@firstuserhere the tildes were meant to express the possibility that numbers could be higher or lower due to things like inactivity. The 60% YES close is relevant, yes, and I was sharing the user counts as additional evidence because market price on Manifold often belies the influence of "whales," e.g., the two NO and one YES positions above M100,000 in this market, which I think most people agree have more influence on estimation than they should.
I probably agree with you about the general trend on Manifold of people holding positions irrationally and that an unfortunately large amount of Manifold trading is predicting the idiosyncrasies of individual creator judgment instead of reality. I'd like to see more practices and policies to combat that.
@Jacy 1. People hold opposite position because of exact situations just like this and that's very bad and frustrating.
polls have their place but that's not relevant here. Here's an example,
If it came down to subjectivity, would you be happy judging the resolution of the market by how many people hold YES or NO? because that's ~ double YES holders than NO holders. Truth is what the markets should resolve to, not consensus.
Many of the top NO accounts are held by a single user, who had 30+ accounts banned earlier for mana-farming and donating it away.
@firstuserhere I didn't claim that markets should be resolved by user counts. I'm just presenting a counterargument to your intimation that the resolution of this market is so obvious that someone would need to be arguing in bad faith to disagree with you.
I'd agree that concerns about the top NO holder being that sort of mana-farmer are an undercutter of this counterargument, but I don't think they completely defeat it because I recognize many of the top NO accounts as being real users, and in general, I think the trends in user counts and comments would hold even if we excluded that user's influence.
>The top NO holder is ukranian mana-scammer/farmer running 30+ accounts.
I have only one account. Also, I am Russian.
Yes, the conditions of the market was fulfilled. AI agents were used to create software commercially by the end of 2023.
Other considerations do not enter the picture. It does not matter what the performance on swe-bench is, the agents were used nonetheless commercially and by many companies.
@firstuserhere As Alex said below for SWE-bench, performance of the tool is part of the definition of "commercial use" here, particularly the explicit criterion of "essentially displacing work of programmers."
I think this will necessarily be a subjective resolution; I could see it going either way, depending on where Alex's bar is set for things like this. Personally, I think they gave pretty clear evidence when clarifying in the comments that they had a high bar compared to what's been available so far, which is why I bet NO. For example:
When AI agents are here and they are cheaper to use than human programmers and provide sufficient quality of work, there will be a huge incentive to use them. So it's very unlikely that there will be only one example of use of an agent.
@Jacy yes DISPLACING work of software engineers (Such as PRs, as Alex says too) does not mean REPLACING all software engineer work across all levels. What a junior developer does is very different from what a senior developer does. The lower bar has been met, the upper bar is years away it seems like. But the market is for the lower bar.
@firstuserhere This displace/replace distinction doesn’t follow as decisive to me considering the description includes this clause:
Indeed, I think full replacement of jobs is indicated by both the comments AND description. Also there were the comments about code standards (google level) and having many users follow suit (10). I bet No based on careful reading of the maker’s clarifying comments.
@firstuserhere If they are used to that end. He didn’t contradict anything from before. Replacement is an outcome, not a method.
@Panfilo And now we do have proof that they're used to that end. Alex said sweep and gpt-dev qualify if what they're advertising their use to be is indeed true and is being used commercially. We've got confirmation from at least 2 companies of that.
@firstuserhere I believe you. I’m overexposed too, but the median and average bettor both agree with me.
@Panfilo I think both sides can agree that the market is not as valuable as it would've been with a clear resolution condition and that @AlexMizrahi should hop in here and clarify stuff more than once every 3 months
@Panfilo I do not agree with you. I believe this market should have resolved months ago as YES.
With the understanding that knowledge and technology are not evenly distributed, arriving at some successful commercial use guarantees that there will be increasing successful commercial use. Even if technological development of LLMs is frozen there will still be minor external innovations to bring that technology to larger markets. This is the low bar that @firstuserhere claims as the threshold for this market and I have to agree. Showing that it is possible a handful of times by a small amount of people outside of a lab setting is exactly the stated criteria required by this market.
I believe too much weight has been placed on the statement "tools like Copilot do not count" without considering the qualifying reason that was given for eliminating Copilot: "they are designed to help a programmer to write code, not to replace a programmer". Copilot itself has had ambitions beyond programming assistant or auto-complete tool since it was launched and has gone though a number of feature updates. At what point does Copilot implement a new feature that makes it more than a programming assistant? The inability to answer that question clearly while the market was open makes Copilot a poor point of comparison for what is or isn't a qualifying AI agent. For this reason, I believe any argument that takes the form of "that's like Copilot but with x" should be discounted.
Software agents have clearly eliminated a whole class of lower tier "Junior" developers and have been shown to operate at a fairly high level semi-autonomously. On average Junior developers, with their limited experience, require significantly more oversight for each line of code written than any currently available LLM agents and work significantly slower. I'd go further and suggest that very few if any Junior developers are being hired in the current job market. I believe the apparent elimination of entry level programing positions satisfies the stated resolution criteria of "replace a programmer".
@becauseyoudo Bang on. This is the debate of AIs are tools even when they're replacing work that was human-labor.
I'm going to quote Richard Ngo from OpenAI:
The next version will be “LLMs are just tools, and lack any intentions or goals”, which we’ll continue hearing until well after it’s clearly false.
@firstuserhere even as "just tools", I think LLMs have already disrupted the process of software development enough to satisfy every clearly stated resolution criteria in this question. Those points that were not clearly or explicitly stated like: "Will AI agents replace all human software developers?" are outside the scope of this question and should be posed as separate questions.
As Alex said below for SWE-bench, performance of the tool is part of the definition of "commercial use" here, particularly the explicit criterion of "essentially displacing work of programmers."
I genuinely do not see what SWE-bench has to do with this. One, the SWE-bench results are not using any of the proposed tools. This is the most important part! Two, SWE-bench has (as far as I can tell) not seen any work on it other than baselines published by the paper that first introduced the dataset. Three, even given all that,
If claim is made in 2023 (e.g. DevGPT and Sweep.dev might qualify), we can wait until more evidence arrives. (E.g. if 1 startup claims they made code using DevGPT, and then 10 more make similar claims in 2024, we consider that supporting the original claim.) If it doesn't, we assume that the signal was fake.
The claim has unambiguously been made, that large enterprises are using codegen to merge PRs in mullion line codebases as far as I can tell? So we can wait for 10 more similar claims in 2024.
@firstuserhere This displace/replace distinction doesn’t follow as decisive to me considering the description includes this clause:
Right, and Copilot does not generate whole PRs like codegen does.
If they are used to that end. He didn’t contradict anything from before. Replacement is an outcome, not a method.
I mean, if a company is using codegen to generate PRs that would otherwise be written by programmers (which was claimed below), that is replacing the work of programmers. Alex repeatedly comments that narrow tools aren't allowed - the tools proposed below are general, they take natural langauge prompts and apply them to codebases.